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This  report  provides  a  plan  for  evaluating  proposed  procedures  for  implenenti 
the  CAT  version  of  the  ASVAB  and  suggests  methods  to  be  used,  if  CAT  is  adopted, 
for  checking  the  utility  and  operational  characteristics  of  the  actual  Imple- 
eentatlon.  The  report  proposes  evaluation  of  item  content,  dimensionality, 
reliability,  validity,  item  calibration,  item  selection  and  scoring,  score 
equations^  human  factors.  Some  special  problems  include  omits,  speeded  tests. 

Slid  item  bias.  Suggestions  are  also  made  for  the  exploring  ways  of  taking  advant 
age  of  the  computerized  presentation  to  get  bettev  information  from  future 
irerslons  of  the  ASVAB. 
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Foreword 


In  1980  the  Office  of  Naval  Research  and  the  Navy 
Personnel  Research  and  Development  Center  invited  a 
group  of  experts  In  psychometrics  to  review  the  current 
plans  for  implementing  a  computerized  adaptive  version 
of  the  tests  used  by  the  Armed  Services  for  initial 
personnel  selection  and  placement.  That  committee 
consisted  of  R.  Darrell  Bock,  Bert  F.  Green  (chair), 
Lloyd  Humphreys,  Robert  L.  Linn,  and  Mark  Reckase,  with 
Charles  Davis  as  the  ONR  monitor.  The  committee  has 
met  with  members  of  CATICC,  Computerized  Adaptive 
Testing  Interservlce  Coordinating  Committee,  and  has 
discussed  Issues  with  many  other  leaders  in  the  field. 
James  McBride,  Malcolm  Ree,  Major  Mike  Patrow,  Hilda 
Wing  and  Charles  Davis  of  CATICC  have  been  very 
helpful.  Ue  acknowledge  the  advice  of  many  colleagues, 
especially  Huyhn  Huyhn,  Michael  Levine,  Fred  Lord, 
Melvin  Novlck,  Fumlko  Samejlma,  Hariharan  Swaminathan, 
James  Sympson,  Vern  Urry,  and  Thomas  Warm.  This  advice 
was  sometimes  contradictory,  but  always  helpful.  The 
report  that  follows  is  the  committee's  final 
recommendation,  based  on  the  literature,  the  advice  of 
others  and  its  own  best  judgment.  The  report 
represents  the  committee  consensus  and  is  to  be  taken 
as  coming  from  the  committee  as  a  whole. 


Bert  F.  Green 
Darrell  R.  Bock 
Lloyd  G.  Humphreys 
Robert  L.  Linn 
Mark  D.  Reckase 


I.  Executive  Summary 


The  United  States  Armed  Services  are  planning  to  Introduce 
computerized  adaptive  testing  (CAT)  Into  the  Armed  Services  Vocational 
Aptitude  Battery  (ASVAB),  which  Is  a  major  part  of  the  present  personnel 
assessment  procedures.  Adaptive  testing  should  Improve  efficiency  greatly 
by  assessing  each  candidate's  answers  as  the  test  progresses  and  posing 
Items  most  appropriate  for  that  candidate,  thus  avoiding  Items  that  are  too 
easy  or  too  hard.  Computer  presentation,  recording,  and  scoring  of  the 
ASVAB  will  Improve  test  security. 

This  report  provides  a  plan  for  evaluating  proposed  procedures  for 
Implementing  the  CAT  version  of  the  ASVAB  and  suggests  methods  to  he  used. 
If  CAT  Is  adopted,  for  checking  the  utility  and  operational  characteristics 
of  the  actual  Implementation.  Suggestions  are  also  made  for  the  exploring 
ways  of  taking  advantage  of  the  computerized  presentation  to  get  better 
Information  from  future  versions  of  the  ASVAB. 

The  evaluation  plan  la  based  on  the  assumption  of  a  gradual  transition 
to  CAT,  In  which  both  CAT  and  traditional  paper-and-pencll  tests  (P&P)  will 
be  given.  This  Implies  that  the  CAT  version  must  3rleld  scores  that  are 
essentially  equivalent  to  scores  from  the  P&P  version.  The  plan  also 
assumes  the  availability  of  prototype  adaptive  testing  equipment  before 
operational  Implementation. 

The  role  of  the  computer  In  adaptive  testing  is  to  present  each  test 
question  ( Item)  on  a  display  screen,  to  record  and  score  the  response,  to 
make  new  estimates  of  the  candidate's  ability  after  each  Item  response,  and 
to  select  a  next  Item  that  will  give  the  best  additional  Information  about 
the  candidate's  ability.  This  procedure  requires  that  each  test  have  a 
large  pool  of  Items  with  widely  varying  difficulty. 

The  system's  estimates  of  ability  and  selections  of  Items  are  based  on 
a  probability  model  of  Item  responses  called  Item  response  theory  (IRT). 

The  theory  provides  a  curve  for  each  Item,  showing  the  probability  of  a 
correct  answer  as  a  function  of  ability.  In  the  most  widely  used  model  the 
curve  Is  characterized  by  three  parameters:  a,  the  slope  or 
dlscrlmlnablllty;  b,  the  difficulty,  and  c,  the  lowest  possible  probability 
of  a  correct  answer,  called  the  "pseudochance”  level.  The  CAT  procedures 
discussed  here  are  based  on  these  Item  parameters. 
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Specific  Recommendations 


New  Items  will  be  needed  for  the  CAT  versions  of  the  ASVAB.  They 
should  cover  the  same  content  as  the  content  In  ASVAB  forms  11,  12,  and  13, 
which  will  be  the  concurrent  P&P  tests. 

Each  test  In  the  battery  must  measure  a  single  ability  or  dimension. 
Selecting  Items  that  are  highly  discriminating  tends  to  secure 
unldlmenslonallty ,  but  this  should  be  verified  by  other  analyses.  If  more 
than  one  dominant  factor  is  Involved  In  a  test,  proper  operation  of  CAT 
will  require  that  the  test  be  divided  Into  corresponding  subtests. 

The  standard  error  of  measurement  at  specified  score  levels  Is  the 
best  assessment  of  measurement  error  In  CAT.  For  some  purposes,  however. 

It  may  be  convenient  to  define  an  average  coefficient  of  reliability  or  a 
reliability  at  specified  score  levels.  Test  reliability  should  also  be 
assessed  empirically  by  testing  a  group  of  examinees  twice  with  different 
Items  and  correlating  the  scores. 

The  validity  of  the  CAT  version  must  be  demonstrated.  Validity  has 
several  aspects.  Congruent  validity  requires  that  the  covariance  structure 
of  the  tests  In  the  CAT  battery  match  the  structure  of  the  standard 
paper-and-pencll  ASVAB,  except  possibly  for  scale.  Content  validity  Is 
addressed  by  Item  content  specification.  Empirical  validity  should  be 
assessed  by  giving  the  CAT  to  persons  enrolled  In  various  specialty 
schools,  and  obtaining  criterion  data  on  their  performance  in  training.  To 
the  extent  that  the  CAT  battery  correlates  highly  with  the  paper-and-pencll 
ASVAB,  the  validity  of  the  new  test  battery  can  be  Inferred  from  the 
established  validity  of  the  present  ASVAB. 

Several  competing  methods  are  available  for  estimating  the  parameters 
of  each  Item  In  each  test  Item  pool.  The  stability  and  accuracy  of 
estimates  obtained  by  the  chosen  method  should  be  established  empirically 
by  means  of  simulation  studies  Incorporating  realistic  error  processes. 
Also,  since  the  parameters  for  CAT  Items  will  have  to  be  determined 
Initially  using  paper-and-pencll  administration,  an  empirical  study  Is 
needed  to  determine  what  differences.  If  any,  are  caused  by  the  different 
modes  of  presentation.  If  It  proves  necessary  to  calibrate  Items  In 
batches,  the  calibrations  will  have  to  be  linked.  Methods  for  certifying 
the  linking  procedure  are  proposed.  Finally,  the  characteristics  of  each 
Item  pool  must  be  examined  to  Insure  that  a  reasonable  number  of  highly 
dlscrlmnlnatlng  Items  are  available  at  all  ability  levels  likely  to  be 
encountered. 

The  estimated  ability  from  CAT  Is  on  a  scale  different  from  that  of 
the  P&P  tests.  A  method  of  equating  the  scales  Is  required.  The  details 
of  that  method  should  be  reported. 

Several  possible  rules  could  be  used  for  terminating  the  testing. 
Although  every  test  taker  could  receive  the  same  number  of  Items,  theory 
permits  testing  until  a  sufficiently  small  standard  error  Is  achieved. 

Some  compromise  rule  may  be  used.  The  average  size  of  the  resulting 
measurement  error  must  be  checked  and  reported  as  a  function  of  test  score. 


The  report  discusses  several  human  factors  In  the  equipment  design  and 
testing  procedure.  Including  quiet,  glare,  legibility,  response  feedback, 
and  graphics.  Immediate  display  feedback  of  the  chosen  alternative  is 
recommended,  together  with  a  separate  "verify”  button  to  send  the  results 
on  to  the  computer.  The  display  screen  provides  a  constraint  on  Item 
construction:  each  Item  must  fit  on  the  screen  at  one  time.  This 
constraint  may  be  a  problem  with  some  reading  comprehension  Items. 

Two  of  the  ASVAB  tests.  Numerical  Operations  and  Coding  Speed,  are 
highly  speeded  and  require  special  treatment.  They  will  not  be  adaptive, 
but  will  be  presented  by  the  computer.  The  equipment  and  system  design 
will  affect  the  normlng  of  these  tests,  so  the  tests  roust  be  calibrated  on 
the  operational  equipment.  Provision  for  accurate  measurement  of  response 
times  for  these  Items  Is  critical. 

On  paper-and-pencll  tests,  candidates  may  omit  Items.  Experts  differ 
on  whether  to  permit  students  to  skip  or  omit  Items  In  CAT.  It  Is 
recommended  that  omits  not  be  permitted. 

It  is  Important  that  the  test  scores  and  the  items  not  be  biassed  in 
favor  of  any  subgroups  of  persons.  Studies  are  recommended  of  potential 
differential  validity  of  tests  for  men  and  women,  and  for  various  ethnic 
groups.  Studies  of  Item  bias  are  also  recommended.  Since  similar  studies 
have  recently  been  made  for  the  current  ASVAB,  these  studies  can  await 
Implementation  of  the  system.  The  nature  of  the  computer  presentation 
Itself  is  not  expected  to  favor  any  one  group. 

General  Comments  and  Recommendations. 

In  general,  the  procedures  used  in  CAT  should  be  thoroughly  documented 
and  explained. 

Apart  from  the  specific  projects  proposed  to  evaluate  the  CAT  version 
of  the  ASVAB,  some  research  projects  should  be  undertaken  or  supported  to 
Improve  aspects  of  the  procedures.  CAT  methods  are  still  under  development 
and  further  development  Is  needed.  Multidimensional  models,  and  models 
that  analyze  response  option  characteristics  should  be  developed  further. 

Other  ways  of  using  the  computerized  testing  equipment  should  be 
explored  to  get  more  Information  from  the  ASVAB,  and  eventually  to  alter 
the  ASVAB  by  Including  new  kinds  of  measures.  Such  studies  will  provide  an 
additional  return  on  the  Investment  in  CAT. 


II.  Introduction 


The  United  States  Armed  Services  are  currently  considering  the 
Introduction  of  computer-based  technology  Into  procedures  for  evaluating 
the  cognitive  abilities  of  personnel.  Computer  presentation,  recording, 
and  scoring  of  standardized  tests  can  be  expected  to  make  the  tests  more 
secure  and  the  testing  more  efficient.  With  computer  presentation,  the 
test  can  be  adapted  to  each  candidate's  level  of  ability,  assessing  each 
candidate's  answers  as  the  test  progresses  and  selecting  the  Items  that 
will  give  most  additional  Information  about  the  candidate's  ability. 
Adaptive  computer  presentation  promises  the  same  accuracy  of  measurement  as 
present  paper-and-pencll  (P&P)  tests  In  much  less  time.  Although  present 
plans  are  to  devise  computer  Implemented  versions  of  the  existing  Armed 
Services  Vocational  Aptitude  Battery  (ASVAB),  computer  technology  has  the 
potential  for  a  wider  array  of  assessment  measures  and  will  ultimately 
provide  more  effective  assignment  of  personnel. 

The  Introduction  of  computerized  adaptive  testing  (CAT)  In  the  U.S. 
military  enlistment  procedures  would  be  the  first  large-scale  operational 
use  of  this  technology.  It  Is  therefore  especially  important  that  the 
proposed  procedure  be  carefully  and  thoroughly  evaluated.  Such  an 
evaluation  can  facilitate  a  fully  Informed  decision  about  whether  to 
proceed  with  the  operational  Implementation.  If  the  decision  Is  positive, 
as  we  expect,  then  the  operational  use  of  CAT  should  be  systematically 
monitored  and  evaluated,  to  Insure  that  the  new  method  Is  working  as 
expected.  Also,  further  research  and  development  should  be  done  to  enable 
the  Armed  Forces  to  get  the  maximum  benefit  from  the  system. 

Computer  methods  represent  a  large  change  in  personnel  testing. 

Methods  of  evaluating  tests  must  be  revised  to  suit  the  new  procedures. 

The  concept  of  validity  Is  still  central,  but  the  concepts  of  reliability 
and  scale  equivalence  take  on  new  meanings  and  must  be  evaluated 
differently.  New  Issues  arise  in  reporting  the  efficiency  and 
dependability  of  an  ability  test.  Consequently  the  evaluation  of  a 
computerized  version  of  the  test  must  be  formulated  in  its  own  right,  and 
not  merely  taken  over  from  traditional  test  development  practice. 

This  report  Is  concerned  with  the  evaluation  of  a  computerized 
adaptive  version  of  a  particular  test  battery,  the  Armed  Services 
Vocational  Aptitude  Battery  (ASVAB).  The  report  discusses  the  empirical 
evidence  that  will  be  necessary  and/or  highly  desirable  for  establishing 
the  psychometric  suitability  of  the  new  version.  It  considers  each  of  the 
major  psychometric  properties  -  dimensionality,  reliability,  validity,  and 
score  calibration  -  as  well  as  special  problems  unique  to  a  computerized 
adaptive  test  -  Item  calibration,  adaptive  selection  of  Items,  scoring  the 
test,  and  the  human  factors  that  affect  the  use  of  the  computer  equipment. 
Other  problems  also  addressed  are  Item  bias,  speededness,  and  the 
calibration  of  new  Items  In  an  operational  context.  Finally,  opportunities 
for  further  gains  through  computer  methods  of  testing  are  suggested  as 
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topics  for  further  research  and  development. 

Some  of  the  studies  proposed  are  already  underway  and  some  of  the 
procedures  endorsed  are  those  that  have  already  been  selected.  In  this 
document  we  make  no  dlstlntlon  between  what  should  be  done  and  what  already 
Is  being  done.  Rather,  we  report  on  the  entire  range  of  psychometric 
problems  In  adaptive  testing.  Also,  recommended  methods  are  based  on 
current  knowledge  and  may  be  superceded  by  new  developments. 

The  committee  anticipates  that  the  feasibility  of  CAT  will  be 
established,  and  that  CAT  will  be  available  for  Implementation.  CAT  Is 
expected  to  provide  more  efficient  use  of  available  testing  time  and 
Improved  test  activity.  While  these  are  the  Immediate  economic  benefits, 
CAT  also  has  the  potential  for  many  Improvements  In  personnel  selection  and 
placement  procedures.  The  committee  urges  that  the  potential  economic 
benefit  of  future  capabilities  also  be  considered,  and  that  additional  work 
be  supported  to  develop  some  of  the  many  possibilities. 
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Armed  Forces  Selection  Tests 

The  Armed  Services  of  the  United  States  use  standardized  tests  of 
skill  and  knowledge  as  part  of  their  personnel  recruitment  and  placement 
procedures.  The  Army  General  Classification  Tests  (AGCT)  were  used 
extensively  during  the  second  world  war.  In  1950,  the  Armed  Forces 
Qualification  Test  (AFQT)  was  Introduced  as  a  replacement  for  the  wartime 
tests.  The  AFQT  scale  was  calibrated  to  the  AGCT  scale  using  a  1944 
wartime  reference  group,  although  the  test  was  somewhat  changed  in  content. 

New  forms  of  the  AFQT  were  Introduced  in  1953,  1956,  and  1960. 

Starting  In  1972  and  continuing  through  1975  each  service  used  Its  own  test 
battery.  However,  each  battery  provided  an  estimate  of  an  equivalent  AFQT 
score,  as  well  as  other  scores.  In  1975,  the  Services  again  began  to 
coordinate  their  selection  testing  efforts,  resulting  in  a  new,  expanded 
test  battery,  called  the  Armed  Services  Vocational  Aptitude  Battery 
(ASVAB).  An  early  version  of  this  battery  had  been  in  use  in  a  high  school 
recruitment  program.  New  forms  were  developed,  and  some  content  changes 
were  made  for  the  new  forms,  ASVAB  5,  6,  and  7,  which  were  to  be  used  as 
the  standard  military  recruitment  tests.  Again  the  scale  for  the  parts  of 
the  ASVAB  making  up  the  new  AFQT  composite  was  calibrated  to  the  earlier 
AFQT  scale,  although  content  by  now  was  considerably  expanded  from  the 
original  tests. 

A  calibration  problem  arose  with  forms  6,  and  7  of  the  ASVAB  that 
resulted  In  too  many  underquallfled  persons  being  accepted  Into  the 
service.  The  calibration  was  studied  by  several  groups  within  the 
Department  of  Defense,  and  studies  by  outside  groups  were  also 
commissioned.  Each  study  made  a  slightly  different  recommendation  for 
change.  A  special  outside  technical  committee  (consisting  of  Robert  Linn, 
chair,  Melvin  Novlck  and  Richard  Jaeger)  was  appointed  to  evaluate  the 
studies.  They  recommended  a  change  in  the  calibration  table  that  solved 
the  problem.  (Jaeger  et  al,  1980)  This  episode  is  mentioned  to  emphasize 
that  calibration  Is  a  critical  aspect  of  any  form  of  the  ASVAB. 

In  1978,  a  study  was  made  of  the  possibility  of  applying  to  the  ASVAB 
the  growing  technology  in  computerized  adaptive  tests  (CAT).  Computer 
presentation  has  the  potential  advantage  of  Improved  test  security,  as  well 
as  simplifying  test  scoring  and  reporting.  Mainly,  though,  computer 
presentation  permits  adaptive  testing,  or  "tailored"  testing  as  It  Is 
sometimes  called,  which  can  reduce  testing  time  by  using  testing  time  more 
efficiently.  The  Computerized  Adaptive  Testing  Interservice  Coordinating 
Committee  (CATICC)  was  formed  to  plan  for  development  and  Implementation  of 
a  CAT  version  of  ASVAB. 

At  about  the  same  time,  the  ASVAB  was  being  slightly  restructured,  and 
new  forms  were  being  developed.  ASVAB  forms  8A,  8B,  9A,  9B,  lOA,  and  lOB 
became  operational  In  October  1980.  New  forms  of  the  same  test,  with  no 
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Table  I.  Teats  of  the  ASVAB  Forms  8,9,  and  10 


Tests  Included  in  the  Araed  Forces  Qualifying  Test  composite  are  enclosed  in  dotted 
lines. 


Tests 


Number  of  Items  Testing  Time  (minutes) 


1. 

General  Science 

25 

11 

2. 

Arithmetic  Reasoning 

30 

36 

3. 

Word  Knowledge 

35 

AFQT 

11 

4. 

Paragraph  Gomprehenslon 

15 

13 

5, 

Numerical  Operations  (speeded) 

50 

3 

6. 

Coding  Speed  (speeded) 

84 

7 

7. 

Auto  and  Shop  Information* 

25 

11 

8. 

Mathematics  Knowledge 

25 

24 

9. 

Mechanical  Comprehension* 

25 

19 

10. 

Electronics  Information* 

20 

9 

Total  Questions  334 

Total  Testing  Time:  2  hrs.  24  mins. 


*Some  test  items  include  diagrams 
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content  change,  are  now  being  developed  for  Introduction  late  in  1983. 

Theae  new  forms  are  ASVAB  llA,  IIB,  IZA,  12B,  13A,  13B.  If  the  computer 
version  of  the  ASVAB  is  approved,  it  could  be  implemented  late  in  198A,  but 
would  be  phased  in  gradually.  Thus  the  CAT  version  and  the  new 
paper-and-pencll  versions  of  the  ASVAB  would  have  to  be  as  nearly 
equivalent  as  possible,  although  each  may  have  its  respective  norm  table,  a 
candidate  should  have  the  same  probability  of  selection  and  classification 
no  matter  which  version  of  the  test  is  taken. 

Currently  the  ASVAB  consists  of  the  ten  tests  listed  In  Table  I. 
Numbers  of  Items  and  testing  times  are  indicated.  These  times  do  not 
Include  the  time  to  read  and  understand  the  directions,  nor  the  time  for 
any  other  administrative  details,  including  rest  periods.  I'lhen  all  these 
things  are  taken  Into  consideration,  the  ASVAB  takes  about  A  hours  to 
administer.  (McBride,  private  communication). 

Except  for  Tests  5  and  6,  which  are  speeded,  the  timing  has  been 
established  so  that  most  test  takers  will  finish  most  of  the  tests.  Data 
show  that,  excluding  the  two  speeded  tests,  each  item  is  attempted  by  about 
98Z  of  the  test  takers,  on  the  average.  The  last  item  on  each  test  Is 
attempted  by  92.2%  of  the  test  takers  on  the  average.  The  raw  score  on  the 
test  Is  the  number  of  correct  answers.  There  Is  no  penalty  for  guessing. 

The  instructions  for  the  ASVAB  say,  "Remember,  there  is  only  ONE  BEST 
ANSWER  for  each  question.  If  you  are  not  sure  of  the  answer,  make  the  BEST 
GUESS  you  can.”  Each  test  Includes  the  Instruction,  "Don't  spend  too  much 
time  on  any  one  question.”  (On  a  recent  survey  of  a  national  probability 
sample  of  ASVAB  takers,  about  2/3  said  that  they  did  guess,  the  others  said 
they  did  not.) 

The  use  of  long  time  limits  on  all  but  the  speeded  tests  makes  the 
ASVAB  a  power  test,  which  Is  good  measurement  practice  when  speed  Is  not 
being  explicitly  evaluated.  But  long  time  limits  raise  administrative 
problems,  since  many  test  takers  finish  a  test  long  before  the  time  limit, 
and  are  forced  to  wait  Idly,  with  possible  adverse  effects  on  anxiety  and 
motivation.  There  would  be  no  such  waiting  with  a  CAT  version,  because 
each  candidate  proceeds  at  his  or  her  own  pace. 

The  raw  score  on  the  AFOT  Is  a  composite  of  the  raw  scores  on  Tests  2, 
3,  A,  and  5,  with  Tests  2,  3,  and  A  getting  unit  weight,  and  Test  5  getting 
1/2  weight.  Various  branches  of  the  Armed  Services  use  other  composites  of 
their  own  design  for  selecting  applicants  to  the  Service  and  for  selecting 
applicants  to  clusters  of  specialty  schools  (Maler  &  Grafton,  1981h). 

ASVAB  scores  are  used  to  make  two  kinds  of  decisions.  First  the 
scores  are  used  to  decide  If  the  candidate  Is  qualified  to  enlist  In  his  or 
her  chosen  service.  At  present  this  decision  is  based  on  the  AFQT  score,  a 
composite  of  four  of  the  ASVAB  test  scores,  as  described  above.  However, 
each  service  has  a  different  cut-off  score  to  determine  qualification. 

None  of  the  services  admit  persons  who  are  In  the  lowest  92  of  the 
reference  population  (19AA  AGCT  test  takers)  and  many  of  those  In  the  next 


10%  are  rejected  as  well 
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A  second  type  of  decision  that  depends  on  the  ASVAB  Is  whether  the 
person  Is  qualified  for  a  particular  specialty  school,  or  a  particular  set 
of  specialty  courses*  Each  specialty  school,  and  sometimes  each  particular 
course  has  Its  own  entrance  criterion,  based  on  a  particular  combination  of 
test  scores,  with  a  particular  cut-off.  There  are  literally  hundreds  of 
different  specialties,  with  different  composites  and  cut-offs.  Of  course, 
admission  to  certain  advanced  schools  requires  not  only  certain  test  scores 
but  also  successful  completion  of  earlier  training  programs. 

With  decisions  being  made  In  so  many  diverse  ways,  It  Is  not  possible 
to  focus  attention  too  closely  on  any  one  test  score.  Even  the  first  level 
decision  Is  based  on  a  composite  of  several  test  scores,  the  AFQT.  It  will 
therefore  be  necessary  to  provide  scores  that  have  good  accuracy  at  all 
score  levels. 

The  two  speeded  tests  are  Numerical  Operations  and  Coding  Speed. 
Special  methods  are  needed  for  computer  versions  of  these  tests.  The 
precise  nature  of  the  response,  l.e.  the  human  factors  component,  is 
critical  in  these  tests. 

The  Navy  Personnel  Research  and  Development  Center  (NPRDC)  now  has  an 
experimental  Installation  where  computerized  tests  are  given  to  Armed 
Forces  personnel,  as  a  part  of  test  research  and  development.  In  addition, 
a  new  experimental  facllty  Is  being  set  up  with  an  array  of  computer-driven 
terminals  to  test  computerized  vers  ion  of  the  ASVAB  as  they  become 
available.  NPRDC  plans  to  begin  preliminary  evaluation  of  a  CAT  battery  In 
1982.  Three  of  the  ASVAB  tests  use  elaborate  diagrams  and  drawings:  Auto 
and  Shop  Information,  Mechanical  Comprehension,  and  Electronics 
Information.  Although  the  current  experimental  facility  at  NPRDC  does  not 
now  have  the  capability  for  graphical  Items,  the  prototype  system  and 
operational  systems  are  Intended  to  Include  a  graphics  capability. 

The  Air  Force  Human  Resources  Laboratory  has  contracted  for  the 
development,  pretesting  and  statistical  analysis  of  an  experimental  pool  of 
200  Items  for  each  of  the  ten  ASVAB  tests.  In  addition,  operational  Item 
pools  are  currently  being  developed.  Plans  call  for  200  Items  In  each 
operational  test  pool. 

Some  evaluative  work  on  CAT  Is  already  under  way.  Some  of  the  studies 
recommended  in  the  present  report  are  already  planned  or  are  in  progress. 
The  present  report  may  lead  to  some  change  of  detail  In  those  studies,  and 
in  any  event  puts  them  In  the  context  of  the  over-all  evaluation,  and 
supports  their  execution. 
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One  large-scale  study  of  the  current  paper-and-pencil  ASVAB  has 
recently  been  completed;  reports  of  data  analyses  will  be  Issued  as  soon  as 
possible.  The  Profile  of  American  Youth  Is  a  national  longitudinal  study. 
In  which  a  carefully  designed  sample  of  persons  In  the  United  States  age 
16-23,  took  form  8A  of  the  ASVAB,  so  that  national  norms  could  be 
constructed.  Considerable  additional  Information  about  the  ASVAB  can  be 
obtained  from  this  excellent  data  base  (Department  of  Defense,  1982). 

The  ASVAB  technical  manual  (Wllfong,  1980)  contains  much  useful 
Information.  Past  history  was  culled  from  a  review  by  the  ASVAB  Working 
Group  (1980).  See  also  Department  of  Defense,  1980. 
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Adaptive  Testing 


The  principal  Idea  of  adaptive  testing  Is  simply  that  each  test  taker 
Is  asked  questions  that  are  appropriate  for  his  or  her  level  of  skill  or 
ability.  It  Is  Inefficient  to  ask  questions  that  are  too  easy  or  too 
difficult  for  the  candidate,  since  Chose  responses  contribute  very  little 
Information  about  that  person's  ability.  The  terms  adaptive  testing  and 
tailored  testing  will  be  used  as  synonyms  In  this  report. 

The  method  of  adaptive  testing  has  roots  in  early  psychological 
measurement.  Psychophysicists,  beginning  with  Wundt,  determined  sensory 
thresholds  by  presenting  stimuli  at  varying  Intensities  according  to  the 
observer's  ability  to  sense  them.  Binet,  (1909),  the  father  of  mental 
testing,  asked  each  child  questions  appropriate  to  the  child's  age,  and 
moved  up  or  down  the  age  scale  depending  on  the  child's  answers.  The 
process  of  choosing  Items  appropriate  to  the  child's  mental  ability  can  be 
viewed  as  fitting  the  test  to  the  test-taker,  hence  the  term  tailored 
testing.  Such  a  procedure  Is  very  difficult  to  manage  If  people  are  tested 
In  groups  rather  than  one  at  a  time,  so  ordinary  pencll-and-paper  (P&P) 
tests  present  the  same  Items  to  all  test-takers.  The  items  on  group  tests 
vary  In  difficulty  over  a  range  appropriate  to  the  population  being  tested, 
so  group  tests  are  roughly  matched  to  the  population,  but  cannot  be 
tailored  to  the  individuals. 

Several  attempts  have  been  made  to  approximate  a  tailored  test  using  a 
P&P  mode.  Lord  (1971)  proposed  a  flexllevel  test  in  which  items  were 
ordered  In  difficulty.  Everyone  started  with  the  Item  of  median 
difficulty.  A  special  answer  sheet  revealed  whether  a  response  was  correct 
or  Incorrect.  T<)henever  a  candidate  answered  an  item  correctly  he  tried  the 
next  harder  Item  that  he  had  not  already  tried,  and  whenever  he  got  an  item 
wrong,  he  tried  the  next  easier  one. 

Another  procedure  consists  of  a  routing  test  followed  by  a  second  test 
selected  from  a  series  of  tests  that  are  graded  in  difficulty  (Lord,  1971). 
The  score  on  the  routing  test  Indicates  which  second  test  a  candidate  is  to 
take.  Both  schemes  are  more  efficient  than  a  conventional  test.  But  both 
schemes  are  cumbersome,  and  neither  gains  the  full  efficiency  possible 
with  an  Individually  tailored  test.  An  experimental  comparison  of  these 
procedures  has  been  made  by  Friedman,  Steinberg,  and  Ree  (1981). 

With  a  digital  computer  to  present  the  test  items,  Item-by-ltem 
adaptive  testing  becomes  feasible.  The  computer  can  score  each  response 
immediately  and  can  then  select  the  next  item  that  will  be  most  appropriate 
for  the  candidate.  Each  candidate  gets  a  set  of  Items  uniquely  selected 
for  him  or  her.  More  specifically,  each  person's  first  Item  generally  has 
about  medium  difficulty  for  the  total  population.  Those  who  answer 
correctly  generally  get  a  harder  item;  those  who  answer  incorrectly  get  an 
easier  Item.  After  each  response,  the  examinee's  ability  Is  estimated, 
along  with  an  Indication  of  the  accuracy  of  the  estimate.  The  next  item  to 
be  posed  is  one  that  will  be  especially  Informative  for  a  person  of  the 
estimated  ability,  which  generally  means  an  item  for  which  the  probability 


of  a  correct  response,  at  that  ability  level.  Is  In  the  neighborhood  of 
.65.  Normally,  the  process  results  in  harder  questions  being  posed  after 
correct  answers  and  easier  questions  after  Incorrect  answers.  Ideally,  the 
change  In  Item  difficulty  from  step  to  step  Is  usually  larger  earlier  In 
the  sequence  when  less  Is  known  about  candidate's  ability,  but  later  In  the 
sequence  the  difficulty  changes  less  radically  as  the  system  tries  to 
refine  Its  estimate  of  the  candidate's  ability.  The  process  continues, 
until  there  Is  enough  Information  to  place  the  person  on  the  ability  scale 
with  a  specified  level  of  accuracy,  or  until  some  more  pragmatic  criterion 
Is  achieved.  If  desired,  each  candidate's  score  on  a  CAT  can  be  estimated 
to  the  same  level  of  accuracy.  By  contrast,  high  and  low  scores  on  a  group 
test  are  typically  less  accurate  than  scores  near  the  mean. 

A  CAT  consists  of  a  set  of  Items,  called  an  Item  pool  or  item  bank, 
from  which  particular  Items  are  selected  for  presentation  to  the  candidate. 

The  precision  of  the  CAT  depends  on  the  characteristics  of  the  Items  In 
the  pool.  If  the  pool  hi  not  large  enough,  and  Is  not  well-matched  to  the 
ability  distribution  of  the  group  being  tested,  the  advantages  of  an 
adaptive  test  will  not  be  fully  realized.  If  for  example,  the  adaptive 
procedure  Indicates  that  the  next  item  for  a  particular  person  should  be 
moderately  easy,  but  there  are  no  more  moderately  easy  Items,  the  system 
will  have  to  settle  for  an  item  that  Is  very  easy,  or  for  one  that  Is 
moderately  difficult,  with  the  result  that  less  Information  will  be 
obtained  than  If  an  appropriate  item  has  been  available.  Thus  adaptive 
testing  requires  a  sufficient  supply  of  items  at  each  ability  level.  If 
security  considerations  suggest  that  the  items  be  varied,  this  Implies  a 
need  for  several  alternatives  at  each  ability  level,  so  large  item  pools 
are  needed  for  adaptive  tests. 

Adaptive  testing  places  new  demands  on  psychometric  test  theory  and 
method.  Classical  test  theory  is  not  adequate;  methods  appropriate  for 
group  tests  will  not  work  with  adaptive  tests.  The  most  obvious  problem  Is 
that  the  test  score  can  no  longer  be  the  number  of  Items  answered 
correctly.  In  an  Ideal  tailored  test,  after  the  first  few  items,  everyone 
will  tend  to  answer  about  the  same  number  of  items  correctly.  The  score 
must  depend  In  some  way  on  the  characteristics  of  the  items  answered 
correctly. 

Also  the  indices  commonly  used  to  judge  the  quality  of  the  Items  are 
less  appropriate.  The  standard  Index  of  Item  difficulty  is  the  proportion 
of  persons  answering  the  Item  correctly,  which  Is  dependent  on  the 
population  of  test  takers.  Likewise,  the  standard  Indices  of  Item 
discriminating  power,  such  as  the  Item-test  correlation,  are  also  dependent 
on  the  population. 

Finally,  adaptive  tests  place  more  stingent  demands  on  the  test  items 
In  the  Item  pool.  Adaptive  tests  are  presently  designed  to  work  with  Items 
all  measuring  a  single  aspect  or  dimension  of  ability.  Adaptive  testing  Is 
based  on  the  notion  of  Items  and  people  placed  along  a  single  scale  of 
ability.  Unldlmenslonallty  of  the  test  items  is  therefore  central. 

Although  adaptive  methods  may  eventually  be  developed  for  multidimensional 
test  domains,  present  procedures  expect  a  single  dimension.  When  a  test 


has  one  strong  dimension,  but  several  facets,  as  when  verbal  skill  Is 
measured  by  different  types  of  Items  (antonyms,  analogies,  etc.),  then 
special  precautions  are  needed  In  an  adaptive  environment  to  balance  the 
facets.  This  Issue  Is  discussed  In  more  detail  In  the  body  of  the  report. 

Some  concern  has  been  expressed  about  possible  legal  challenges  to  the 
equity  of  adaptive  testing.  The  fact  that  the  candidates  do  not  take  the 
same  Items  might  be  Interpreted  to  mean  that  they  do  not  all  take  the  same 
test.  CAT  might  be  challenged  for  not  permitting  some  candidates  to 
display  their  ability,  because  It  does  not  give  them  the  opportunity  to 
answer  the  more  difficult  Items.  Such  a  challenge  may  possibly  be  raised, 
but  It  seems  to  us  to  be  without  merit.  At  present,  not  all  candidates 
take  the  same  P&P  test  form.  In  most  testing  programs  there  are  several 
different  test  forms,  all  calibrated  so  as  to  be  equivalent.  The  questions 
differ,  but  the  area  of  skill  or  knowledge  assessed  Is  the  same  on  all  test 
forms,  and  every  candidate  has  the  same  opportunity.  In  the  same  way, 
every  candidate's  encounter  with  the  CAT  form  of  the  test  offers  the 
equivalent  opportunity.  Indeed  one  of  the  overriding  considerations  In  the 
evaluation  for  CAT  recommended  In  the  present  report  Is  the  assurance  of 
equivalence,  so  that  each  candidate  does  have  the  same  fair  chance. 

It  should  be  noted  that  the  concept  of  fairness  Involves  equal 
opportunity,  not  equal  treatment.  In  a  track-and-field  meet,  each 
competitor  must  have  the  same  chance  at  the  high  Jump,  but  fairness  does 
not  require  that  a  person  who  can't  clear  a  six-foot  high  jump  nevertheless 
be  given  a  chance  at  seven  feet.  The  point  Is  to  see  how  high  each  person 
can  jump,  not  to  permit  each  person  license  to  try  all  levels.  In  a  tennis 
tournament.  It  Is  not  considered  necessary  for  every  player  to  play  the 
best  players  -  only  that  every  player  have  the  same  Initial  chance.  In  the 
same  way,  a  CAT  provides  every  candidate  the  same  Initial  opportunity. 
Further,  those  who  fall  the  first  two  or  three  Items  can  still  get  a  good 
score  If  they  pass  all  the  subsequent  Items.  CAT  continually  gives  each 
candidate  additional  chances.  No  fairness  Is  lost  by  not  asking  a 
candidate  questions  that  are  too  easy  or  too  difficult.  Indeed,  by 
providing  more  accuracy  for  high  and  low  scores,  the  test  is  potentially 
more  fair. 

Early  work  on  adaptive  testing  Is  discussed  In  Harman  et  al,  (1968); 
Holtzman  (1970),  and  Wood  (1973).  More  recent  accounts  can  be  found  In 
U.S.  Civil  Service  Commission  (1976),  and  Weiss  (1974,  1978,  1980). 
Applications  have  been  discussed  by  Orry  (1977),  Lord  (1977a, h),  and 
Kreltzberg  &  Jones  (1980). 
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Item  Response  Theory 

Classical  test  theory  Is  not  suited  to  adaptive  tests.  Classical 
theory  supposes  that  all  test-takers  confront  the  same  set  of  test  Items, 
as  In  the  conventional  P&P  tests.  Classical  Indices  of  reliability, 
validity,  and  Item  quality  are  relevant  to  a  particular  set  of  Items  and  a 
particular  population  of  test-takers.  But  an  adaptive  test  Is  different 
for  each  taker,  and  Is,  In  principal.  Independent  of  the  particular 
population. 

A  theory  that  Is  appropriate  for  adaptive  tests  was  developed  by  Rasch 
(1960),  Lawley  (1943),  Tucker  (1946),  Lord  (1952),  Samejlma  (1969),  Owen 
(1975),  and  others.  This  new  theory,  now  called  Item  response  theory 
(IRT),  was  discussed  by  Blrnbaum  (1958)  as  latent  trait  theory,  and  appears 
in  Lord  &  Novlck's  (1968)  major  treatise  on  test  theory.  Hambleton  &  Cook 
(1977),  and  Warm  (1978)  give  good  Introductions.  More  complete  accounts  of 
IRT  have  been  given  recently  by  Lord  (1980),  Urry  &  Dorans  (1980),  and  Urry 
(1981).* 

The  theory  postulates  that  persons  vary  in  the  ability  being  assessed 
by  the  test,  and  that  their  abilities  are  distributed  along  a  continuum 
labelled  8,  from  low  to  high.  Each  person  has  a  particular  ability  level; 
the  ability  of  Person  1  Is  The  probability  of  answering  an  item 

correctly  is  assumed  to  vary  with  ability,  symbolized  for  Item  J  by 
Pj(8).  The  model  assumes  a  particular  form  for  this  probability 
function.  The  traditional  choice  Is  the  cumulative  normal  function  (ogive) 
but  the  cumulative  logistic  curve  is  essentially  Indistinguishable  from  the 
cumulative  normal,  and  Is  mathematically  convenient.  Its  mathematical 

form,  shown  by  Items  1  and  2  In  Fig.  I,  Is 

• 


Pj(9^)=l/[l+e’“ji] 


where 


ji 


'1.7aj  (0j^-bj), 


*  The  term  "latent  trait  theory"  Is  used  in  the  earlier  literature,  rather 
than  "Item  response  theory."  "Latent"  signifies  that  the  ability  or  skill 
being  assessed  Is  Inferred  from  the  item  responses,  and  Is  In  this  sense 
latent  In  the  Item  responses;  "trait"  merely  refers  to  a  characteristic  of 
the  examinee  that  Is  sufficiently  stable  to  be  measured.  However,  some 
laypersons  may  Interpret  the  terms  "latent  trait"  In  a  non-technlcal  sense 
as  Implying  a  fixed.  Inherited  property  of  the  Individual  not  alterable  by 
draining.  This  Interpretation  Is  Incorrect,  and  Is  In  no  way  appropriate 
to  tests  of  vocational  skills  and  knowledge,  so  the  neutral  phrase  "item 
response  theory"  Is  preferred. 


Ability  Level  {B) 

Figure  I.  Illustrative  Item  Response  Curves. 

Item  I  is  more  discriminating  than  Item  2 
Item  3  includes  guessing. 
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So  far,  the  model  is  not  novel,  but  Is  simply  borrowed  from 
psychophysics,  where  Pi(6)  is  the  probability  of  detecting  the  presence 
of  some  stimulus,  or  from  biological  assay,  where  is  the  proportion 

of  samples  that  exhibit  some  property.  What  Is  special  about  the  test 
theory  context  Is  that  the  ability  values,  the  ftj's,  are  unobserved,  and 
Indeed  are  unobservable.  The  nature  of  the  6  variable  is  In  part 
determined  by  the  assumption  that  the  response  curve  of  each  Item  has  the 
form  of  Equation  (1),  varying  only  in  aj  and  bj.  The  nature  of  &  Is 
further  specified  by  the  fundamental  assumption  of  the  model  that,  for  a 
fixed  value  of  d,  responses  to  the  Items  are  Independent.  Thus,  the 
probability  that  a  person  of  ability  answers  both  Items  J  and  ^  correctly 
is  simply 


P.(0)xP,  (0). 
j  ^ 


This  Is  called  the  assumption  of  local  Independence.  It  means,  in  essence, 
that  the  Item  responses  are  related  to  each  other  only  because  they  are  all 
related  to  the  ability  scale,  0.  The  source  of  the  Interltem  relationship 
Is  the  underlying  ability  0,  which  Is  being  measured  by  the  items.  This 
assumption  la  fundamental  to  many  models  of  Individual  differences, 
Including  common  factor  analysis  and  latent  structure  analysis. 

The  model  described  above  Is  frequently  called  the  two-parameter  model 
because  each  Item  response  curve  has  two  parameters,  and  bj^.  A 
simpler,  one-parameter  version  of  the  model  has  many  attractive  features; 

It  Is  obtained  by  assuming  that  all  items  can  be  treated  as  being  equally 
discriminating,  so  that  aj  «  a  for  all  j.  The  resulting  model,  also 
called  the  Rasch  model,  has  been  advocated  by  Andersen  (1973),  Fischer 
(1973),  and  Wright  (1977).  Unfortunately  this  model  does  not  fit  most 
data.  Items  are  not  equally  discriminating,  and  the  Inequality  matters. 
Koch  &  Reckase,  (1978),  and  Patience  A  Reckase  (1979)  showed  that  the  more 
complicated  models  performed  better  than  the  Rasch  model.  The  simple  model 
does  not  take  differences  In  Item  discrimination  into  account  when 
selecting  Items  to  present. 

Neither  model  Is  adequate  for  multiple-choice  Items,  In  which  the  Item 
may  be  answered  correctly  by  chance.  Some  test-takers  guess  when  they 
don't  know  the  right  answer,  and  sometimes  they  are  lucky.  Because  of 
guessing,  the  probability  of  correctly  answering  the  Item  does  not 
necessarily  decrease  to  zero  for  persons  of  very  low  ability,  but  may 
decrease  to  some  minimum  level,  often  called  the  pseudo-chance  level.  The 
pseudo-chance  level  for  an  Item  becomes  the  third  parameter  of  the  Item, 


In  the  three-parameter  model,  the  logistic  Item  response  curve. 
Indicating  the  probability  of  a  correct  response  to  Item  j,  becomes 


Where  as  before. 


Item  3,  in  Figure  1  has  such  a  response  curve.  For  very  low  values  of  0, 


Pj (0)*Cj . 


As  0  Increases,  the  probability  rises  from  cj  to  1,  in  the  same  way  that 
It  rose  from  0  to  1  in  the  earlier  model  that  does  not  Include  the  cj 
parameter. 

It  might  be  supposed  that  for  four-option  Items  like  those  on  many 
tests,  cj  would  be  about  .25.  However,  It  Is  often  found  that  cj  Is 
less  than  would  logically  be  expected  if  wrong  answers  were  random  guesses. 
Not  all  examinees  guess  when  they  do  not  know  the  correct  answer,  and  wrong 
answers  may  be  due  more  to  misinformation  or  Incomplete  Information  than  to 
guessing.  One  study  shows  that  on  some  4-alternative  multiple  choice 
tests,  the  c  parameter  varies  from  .10  to  .35  or  more,  with  a  median  of 
about  .20  to  .25.  Another  study  finds  that  If  all  Item  response  curves  for 
a  similar  test  are  forced  to  have  the  same  c  value,  a  value  of  .10  is  best 
(Bock  &  Mlslevy,  1981).  Adding  the  third  parameter,  cj,  complicates  Item 
response  theory  enormously,  and  It  would  be  an  Immense  convenience  to  leave 
It  out.  Nevertheless,  the  three-parameter  model  ^  needed.  The  model  does 
not  fit  multiple  choice  items  well  when  cj  >  0. 
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The  classical  theory  of  statistical  estimation  provides  a  powerful  way 
of  describing  the  amount  of  Information  in  an  item,  and  in  a  test.  Test 
information  is  inversely  related  to  the  variance  of  measurement  error. 
(Bayesian  theory  provides  an  equivalent  result.)  The  relative  amount  of 
information  that  an  item  provides  about  persons  of  various  abilities  is 
called  the  item  information  function.  It  is  given  by 


where 


and 


I j (0^)= [P j (e^) ] ^/ [Pj  <0i) *  Qj (ei> 1 

Q.(e^)=i-Pj(o^) 


It  can  be  shown  that  maximum  information  occurs  where  the  curve  is 
steepest,  which  is  in  the  vicinity  of  and  that  this  information  is 

proportional  to  This  means,  first,  that  is  indeed  an  index  of 
discrimination,  and  second,  that  it  is  best  to  use  items  with  Rvalues  near 
to  a  person's  ability.  (The  specific  location  of  the  maximum,  and  the 
specific  Information  at  that  point  depend  in  a  complex  way  on  c  .) 

For  a  test  of  fixed  items,  the  information  function  of  the  test  is 
simply  the  sum  of  the  information  functions  of  the  individual  items.  It  is 
easy  to  see  that  for  a  fixed-item  test  to  yield  a  reasonable  amount  of 

information  about  persons  of  a  wide  range  of  ability,  the  item  difficulties 
must  span  a  comparable  range,  with  the  result  of  not  providing  very  much 
information  any^ere.  In  practice  a  compromise  is  usually  struck.  Tests 
are  constructed  with  not  many  very  easy  items,  nor  many  very  hard  items. 

The  information  curve  for  a  test  of  Arithmetic  Reasoning,  shown  in  Figure 
2,  is  typical  of  many  standard  tests.  The  height  of  the  test  information 
function  shows  the  relative  precision  with  which  test  scores  are  measured. 
Figure  2  shows  that  low  scores  and  high  scores  are  not  measured  very 
precisely.  The  reciprocal  square  root  of  the  test  information  function  is 
asymptotically  proportional  to  the  width  of  the  confidence  Interval  for 


esCimatlng  9  from  the  Item  responses,  which  In  turn  is  proportional  to  the 
standard  deviation  of  the  measurement  errors  for  each  fixed  true  ability 
level. 

The  relationship  between  ability  defined  by  the  item  response  model 
and  the  number  right  score  is  in  general  non-linear.  Figure  3  sketches  the 
relation  of  the  ability  scale,  0,  and  the  expected  number-right  score  for  a 
typical  paper-and-pencll  test.  The  relationship  is  nearly  linear  in  the 
middle  of  the  range,  but  is  curvilinear  at  the  extremes. 

For  an  adaptive  test,  we  can  characterize  the  entire  pool  of  available 
items  by  the  information  function  of  the  pool/ which  is  the  sum  of  the 
individual  item  Information  functions  for  all  the  items.  This  Information 
function  would  be  a  much  broader  function  than  that  in  Figure  2.  But  the 
Important  issue  is  the  information  function  for  the  items  given  to  a 
particular  candidate.  In  adaptive  testing,  the  tailoring  process  chooses 
from  this  pool  so  that  the  Information  function  for  each  candidate  will  be 
maximum  in  the  vicinity  of  that  candidate's  ability  level,  9^.  Thus, 
with  an  adaptive  test,  either  the  precision  of  measurement  is  greatly 
Improved,  If  the  number  of  items  is  not  changed,  or  a  given  level  of 
precision  can  be  achieved  with  a  much  smaller  number  of  items  than  would  be 
possible  with  a  standard  fixed-item  test.  (  Bayesian  theory  provides  a 
slightly  different  analysis  but  reaches  the  same  conclusions.) 

It  is  important  to  recall  that  at  the  start  of  the  testing  process  we 
know  little  or  nothing  about  the  candidate's  ability  level.  Consequently, 
in  a  tailored  test,  the  first  item  presented  is  one  that  is  appropriate  for 
the  average  candidate.  (Performance  on  any  previous  test  in  the  battery 
may  be  used  to  Improve  the  initial  choice.)  After  each  item  response,  an 
Improved  estimate  can  be  made  of  the  candidate's  ability,  and  more 
appropriate  items  selected  for  presentation.  At  each  stage  of  the  process 
we  have  not  only  an  estimate  of  the  ability  of  the  candidate  but  also  an 
estimate  of  the  standard  error  of  the  estimate  so  we  know  how  good  our 
current  estimate  is.  Ue  may  stop  when  this  confidence  Interval  becomes 
narrow  enough,  or  we  can  stop  after  a  fixed  number  of  items,  chosen  so 
that,  on  the  average,  the  level  of  precision  is  acceptable. 

In  adaptive  testing,  the  estimate  of  ability  and  the  choice  of  the 
next  item  require  knowledge  of  the  parameters  of  the  item  response  curves  - 
the  a's,  b's.  and  c's.  Estimates  of  these  values  must  have  been  determined 
before  the  testing  process  is  begun.  This  is  usually  done  by  giving  all  of 
the  items  to  comparable,  large  samples  of  candidates,  in  a 


standard  testing  situation.  If  there  are  too  many  Items  for  this  to  be 
practical,  then  overlapping  subsets  of  Items  can  be  given  to  several 
different  samples  of  candidates.  Methods  are  then  available  for  linking 
the  estimates  of  Item  parameters.  There  Is  a  large  literature  on  parameter 
estimation;  see  for  example  Reckase  (1978),  Ree,  (1981)  and  Yen  (1981). 

The  step  of  determining  the  Item  parameters  In  advance  Is  also  a  part 
of  conventional  testing,  where  Item  difficulty  and  Item  discrimination 
Indices  are  obtained  from  pretest  data.  Rut  these  values  are  used  In  a 
somewhat  informal  way  In  constructing  a  conventional  test,  whereas  the  Item 
parameters  are  a  central  part  of  the  adaptive  testing  process. 

Experts  agree  on  the  general  outline  of  the  above  process,  but 
disagree  about  details.  Most  experts  advocate  using  the  three^parameter 
model  for  multiple  choice  tests,  although  some  advocate  the  one-parameter 
logistic  model  (Wright,  1977).  Some  experts  advocate  using  "maximum 
likelihood"  estimators  of  ability,  and  corresponding  estimates  of  Item 
parameters.  Lord  (1980)  and  Samejima  (1977a, b)  are  the  chief  advocates  of 
this  position;  Wood,  Wlngersky  S  Lord  (1976)  have  authored  a  widely  used 
computer  program,  LOGIST,  to  compute  Item  parameters.  Others,  Including 
Urry  &  Dorans  (1980),  advocate  Bayesian  methods.  In  which  an  Initial  prior 
estimate  of  ability  Is  made,  together  with  a  guess  about  the  ability 
distribution,  and  the  Item  response  data  are  used  to  Improve  the  initial 
estimates  by  Bayes's  theorem.  Urry  (1981)  has  prepared  computer  programs 
for  Item  estimates  called  OCIVIA  and  ANCILLES,  to  provide  Bayesian 
estimates  of  Item  parameters.  McKinley  &  Reckase  (1980,  1981a)  give  a 
comparison  of  ANCILLES  and  LOGIST.)  Bock  &  Altken  (1981)  advocate  a 
marginal  maximum  likelihood  approach  to  estimating  Item  parameters,  and 
Bock  &  Mlslevy  (1981)  offer  a  program  called  BILOC  based  on  this  method. 
Bock  advocates  empirical  Bayes  estimation  of  ability  In  the  tailored 
testing  situation,  possibly  with  reduced  weighting  of  extreme  responses. 

The  experts  also  disagree  on  the  best  procedures  for  selecting 
successive  Items  in  the  tailoring  process,  and  In  the  criterion  for 
stopping  the  process.  From  a  purely  theoretical  perspective,  the  next  Item 
to  be  given  to  a  person  should  be  the  most  Informative  Item,  as  judged  by 
the  current  estimate  of  the  person’s  a>*  Jity.  Slavish  following  of  that 
rule  Is  likely  to  result  in  a  few  of  the  very  best  Items  -  items  with  the 
largestaj's  being  used  a  great  deal,  with  other  Items  In  the  pool 
possibly  under-used,  which  may  Jeopardize  test  security.  In  practice, 
there  will  usually  be  many  Items  that  are  almost  as  good  as  the  best  in  any 
situation,  and  It  may  make  very  little  practical  difference  which  one  Is 
selected  from  among  these  possible  Items.  NPRDC  Is  now  conducting  a  series 
of  computer  simulations  of  adaptive  testing  to  evaluate  the  psychometric 
effects  of  alternative  adaptive  testing  procedures.  Including  random  choice 
of  test  Items  In  the  vicinity  of  the  current  ability  estimate  of  a 
candidate. 

Also,  most  experts  advocate  continuing  the  testing  process  until  a 
predetermined  level  of  accuracy  of  the  test  score  (the  estimated  ability) 

Is  reached.  Others  feel  that  little  may  be  lost  In  practice  If  the  same 
number  of  Items  Is  given  to  each  test  taker.  A  compromise  may  be  to  aim 
for  a  predetermined  accuracy,  but  to  place  an  upper  limit  on  the  number  of 
Items  to  be  given. 


In  a  field  as  new  as  computerized  adaptive  testing,  there  are  sure  to 
be  disagreements  among  experts.  The  committee  members  themselves  are  not 
In  complete  agreement  about  all  aspects  of  the  evaluation.  Perhaps  It 
would  be  more  nearly  accurate  to  say  that  there  are  several  Issues  about 
which  there  simply  is  not  yet  enough  information  for  a  decisive  answer. 

But  the  committee  Is  in  complete  agreement  that  the  proposed  procedures  are 
satisfactory,  and  that  the  computerized  version  of  the  ASVAB  can  be  as  good 
as  the  present  version.  If  not  better.  The  proposed  studies  will  Indicate 
whether  in  fact  the  computerized  version  does  live  up  to  its  promise.  We 
expect  that  it  will,  possibly  with  some  adjustments.  Undoubtedly  there 
will  be  room  for  Improvement,  but  ws  are  not  now  able  to  foresee  the 
potential  Improvements.  We  do  not  believe  that  further  theoretical 
developments  will  resolve  these  Issues,  most  of  which  are  at  the  Interface 
between  theory  and  practice.  Current  development  Is  at  a  stage  where 
Implementation  is  needed  In  order  to  obtain  further  Information  about  many 
aspects  of  the  procedures. 
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This  section  lists  proposals  for  evaluating  the  various  Important 
properties  of  a  CAT,  Including  unldlmenslonallty,  reliability,  validity, 
and  equivalence  among  test  forms.  Since  all  of  these  properties  depend 
critically  on  the  way  In  which  the  test  Is  Implemented,  proposals  are  also 
made  for  evaluating  the  quality  of  the  procedures  for  determining  Item 
parameters,  the  procedure  for  tailoring  the  test  to  the  Individual  through 
item  selection;  and  the  procedure  for  determining  the  final  test  score. 

Some  Important  aspects  of  human  factors  In  the  equipment  for  adaptive 
testing  are  noted.  Some  special  problems  are  discussed.  Including  the 
speeded  tests,  the  question  of  whether  omitting  Is  to  be  allowed,  and  Item 
bias. 

Throughout  we  have  adopted  the  style  of  the  Standards  for  Educational 
and  Psychological  Tests  published  by  the  American  Psychological 
Association,  the  American  Educational  Research  Association,  and  the 
National  Council  for  Measurement  In  Education  (APA,  1974).  Recommendations 
are  stated  succinctly,  and  are  rated  "essential”,  "very  desirable",  or 
"desirable."  Discussion  accompanies  each  recommendation. 

In  this  report,  the  terms  "we",  "us",  and  "the  committee"  refer  to  the 
five  authors,  or  a  substantial  majority  of  the  authors.  The  following 
abbreviations  are  used  throughout: 

ASVAR  Armed  Services  Vocational  Aptitude  Battery 

CAT  Computerized  Adaptive  Test 

CATICC  Computerized  Adaptive  Test  Interservice 

Coordinating  Committee 

CEPCAT  Committee  for  an  Evaluation  Plan  for  Computerized 
Adaptive  Tests 

IRT  Item  Response  Theory  (also  called  Latent  Trait  Theory) 

IRC  Item  Response  Curve  (also  called  Item  characteristic  curve, 

item  operating  characteristic,  and  Item  response  function) 

NPRDC  Naval  Personnel  Research  &  Development  Center 

P&P  Paper  and  Pencil  (conventional  group  test) 
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Item  Content  Specification. 

In  a  sense,  the  specification  of  the  Item  content  Is  not  within  the  areas 
covered  In  this  report,  which  Is  focused  on  technical  Issues.  Still, 
content  Is  of  fundamental  Importance.  At  present.  It  Is  Important  that  the 
CAT  Items  cover  the  same  ground  as  the  P&P  tests  with  which  they  are 
Intended  to  be  Interchangeable. 

Cl.  Specifications  for  Item  content  should  be  the  same  for  both  the  CAT 
and  P&P  ASVAB  tests.  Essential. 

Beyond  this  obvious  requirement,  there  Is  nothing  special  about  CAT 
Items  that  does  not  apply  equally  to  P&P  items  on  conventional  tests.  The 
purpose  of  this  requirement  Is  to  help  Insure  the  comparability  of  CAT  and 
P&P  forms  of  the  ASVAR,  on  the  assumption  that  for  an  initial  period,  both 
test  modes  will  have  to  be  used,  while  CAT  Is  being  Introduced.  When  CAT  Is 
established  as  the  primary  testing  mode,  with  P&P  versions  used  rarely.  If 
at  all,  then  this  requirement  is  withdrawn.  Indeed,  we  urge  using  the  new 
medium  to  Improve  assessments  through  new  and  expanded  content. 

Frequently  there  are  Informal  guidelines  about  coverage  or  other 
characteristics  of  Items  for  a  given  test.  Wherever  possible  those 
guidelines  should  be  made  explicit,  so  that  they  are  clearly  understood  by 
those  preparing  the  Items. 

Elsewhere  It  Is  recommended  that  items  be  selected  that  are  highly 
discriminating  (l.e.,  have  high  values  of  aj.)  Assuming  that  such  a 
criterion  is  followed  It  will  be  Important  to  examine  the  selected  Items 
for  coverage  of  content,  to  be  sure  that  Item  selection  has  not  disturbed 
content  specifications. 

C2 .  The  content  of  Items  selected  for  the  final  Item  pool  should  match 
the  content  specifications.  Essential. 

C3.  Test  Items  must  be  compatible  with  CAT  equipment.  Essential. 

Test  Items  must  be  designed  to  be  consistent  with  whatever  equipment  Is  to 
be  used.  At  present,  the  main  constraint  Is  that  each  Item  must  fit 
entirely  on  the  display  screen  at  one  time.  This  can  be  a  problem  for 
reading  comprehension  Items,  which  typically  Involve  a  paragraph  to  be 
read,  and  one  or  more  questions  to  be  answered  about  the  paragraph. 


Dimensionality 

Present  methods  of  adaptive  testing,  and  the  Item  response  theory  (IRT) 
on  which  the  methods  are  based,  require  that  the  test  be  unidimensional. 
Each  Item  should  measure  the  same  unitary  construct,  in  addition  to  Its 
specific  and  error  components.  Unldlmenslonallty  Is  always  advisable  with 
tests  of  ability,  but  It  is  more  critical  for  adaptive  tests.  Thus,  a 
necessary  precurser  to  tailored  testing  is  the  demonstration  that  the  item 
pool  is  actually  unldlmenslonal.  Such  a  demonstration  can  use  existing 
data  for  the  current  ASVAB  tests,  since  It  is  difficult  at  present  to  deal 
with  item  response  data  from  tailored  tests  for  purposes  of  test  analyses. 


There  are  several  possible  ways  to  obtain  evidence  of 
unldlmenslonallty.  One  way  would  be  to  show  that  the  IRT  model  provided  an 
adequate  fit  to  the  item  response  data.  A  factor  analysis  of  the  Item 
Intercorrelations  could  also  give  useful  evidence.  A  variety  of  other 
methods  have  been  suggested,  or  under  development.  Unfortunately  each 
method  had  drawbacks,  so  it  would  be  prudent  to  use  more  than  one  method. 

Although  the  theory  is  based  on  unidimensional  items,  empirical  results 
show  that  the  model  is  suitable  when  the  Items  have  on  dominant  dimension. 
Items  also  related  to  small  secondary  dimensions  will  tend  to  have  smaller 
a  values,  but  will  not  distort  the  system. 


Dl.  The  fit  of  the  model  should  be  checked.  Very  desirable. 


It  might  be  thought  that  the  fit  of  the  IRT  model  to  the  item 
response  data  would  provide  the  primary  indication  of  the  adequacy  of  the 
unidimensional  model.  After  all,  the  model  is  unidimensional,  so  if  the 
model  fits  the  data,  the  data  can  therefore  be  treated  as  unidimensional, 
so  if  the  model  fits  the  data,  the  data  can  therefore  be  treated  as 
unidimensional.  However,  it  appears  that  the  fit  of  the  IRT  model  is  not 
very  sensitive  to  lack  of  unidimensionality  (Jones,  1980).  The  main 
mechanism  for  assessing  the  fit  of  the  model  is  to  compare  empirically- 
determined  item  response  curves  to  the  curve  determined  by  the  item 
parameters.  Items  that  are  multidimensional  will  tend  to  have  smaller 
values  of  £  (l.e.,  IRT  curves  with  shallow  slopes.)  The  extent  to 
which  these  items  form  meaningful  subgroups  must  be  assessed  in  other 
ways.  This  does  not  mean  that  the  fit  of  the  model  is  irrelevant  to 
dimensionality.  Fit  is  necessary,  but  it  is  not  sufficient. 
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The  main  reason  for  comparing  the  model  and  empirical  Item  response 
curves  Is  to  check  the  accuracy  of  the  Item  parameter  estimates  and  to 
Indicate  item  Idlosyncracles.  This  recommendation  is  repeated  as  a  part  of 
the  evaluation  of  methods  of  obtaining  the  item  parameters*  It  Is  relevant 
here  in  showing  that  the  model  appears  to  fit. 

D2.  Highly  discriminating  Items  should  be  selected.  Essential. 

The  main  mechanism  for  insuring  unldimenslonallty  Is  Item  selection. 
Tailoring  works  best  when  Items  are  highly  discriminating,  which  means  that 
they  have  high  values  of  £  and  high  correlations  with  the  total  test  score. 
Urry  (1981)  suggests  selecting  only  Items  with  values  of  £  at  least  0.8  (on 
a  scale  on  which  ability  has  mean  *  0,  standard  deviation  =*1.)  To  relate 
this  criterion  to  more  familiar  parameters.  It  can  be  shown  that  for  £  = 
.25,  the  blserlal  correlation  of  the  Item  and  the  ability  is  about  .53  when 
£  ■  .8.  This  Is  relatively  high  for  an  Item-test  blserlal  correlation. 
Thus,  requiring  0.8,  or  some  similar  cut-off  means  that  Items  will  have 
high  correlations  with  ability.  Multifaceted  items  will  tend  to  have  lower 
values  of  £.  Thus  selecting  Items  with  high  values  of  £  will  In  Itself 
tend  to  Insure  unldimenslonallty.  The  selection  of  items  with  high  £-values 
will  also  help  to  make  the  system  efficient,  since  fewer  items  will  be 
needed  for  each  person. 

There  is  a  danger  that  a  rigid  requirement  of  a®0.8  will  force 
rejection  of  some  good  Item  types.  It  is  well  to  adjust  the  requirement  to 
the  level  that  Is  practically  feasible.  Also,  there  may  he  other  reasons 
such  as  balanced  content,  that  may  Indicate  Including  some  Items  with  lower 
£  values. 

D3.  A  factor  analysis  of  the  Interltem  tetrachorlc  correlations  should  be 
performed.  Very  desirable. 

In  principal,  unldimenslonallty  can  be  examined  through  a  factor  analysis 
of  the  item  intercorrelations.  However,  determining  the  item 
intercorrelatlons  Is  a  problem.  Phi  coefficients  are  usually 
unsatisfactory,  because  their  size  depends  on  the  Item  difficulties,  so 
they  tend  to  yield  difficulty  factors.  Phi  coefficients  are  reasonable 
when  item  difficulties  are  not  too  disparate,  but  ASVAB  Items  vary  widely 
In  difficulty.  A  better  procedure  is  to  use  tetrachorlc  correlations, 
although  they  are  not  completely  satisfactory  either.  Sometimes  the  matrix 
of  tetrachorlcs  is  not  positive  definite,  thus  violating  a  requirement  of 
factor  analysis.  In  general,  large  samples  of  test-takers  are  needed  when 
using  tetrachorlcs.  Also,  when  the  Items  can  be  answered  correctly  by 
chance,  tetrachorlcs  are  distorted.  Methods  of  correcting  for  the 
distortion  have  been  given  by  Carroll  (1946),  Urry  (1981),  and  Samejlma 
(private  communication)*.  This  correction  should  permit  reasonable  results 
from  a  factor  analysis  of  tetrachorlcs.  However,  Reckase  (1981)  has  found 
that  If  tetrachorlcs  are  overcorrected,  results  are  severely  disturbed.  Py 
contrast,  undercorrection  of  tetrachorlcs  Is  relatively  safe. 
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A  factor  analysis  of  Interltem  tetrachorlcs  can  be  Indicative  of 
unldlmenslonallty.  Minor  departures  from  unidlmenslonallty  will  probably 
not  be  serious.  If  there  Is  one  prominent  factor,  and  If  the  secondary 
factors  exhibit  either  no  dlscernable  pattern,  or  tend  to  be  related  to 
Item  difficulty,  then  unidlmenslonallty  Is  supported.  By  contrast.  If 
there  are  two  or  three  prominent  factors,  and  If  these  factors  can  be 
rotated  (using  the  oblique  or  correlated  factor  model)  to  show  meaningful 
distinctions,  then  unidlmenslonallty  Is  challenged. 

A  variety  of  similar  procedures  can  be  used  to  assess  the  factorial 
structure  of  binary  Items.  They  are  all  relatively  new,  and  not  much 
experience  In  their  use  has  accrued.  Nevertheless,  they  are  viable 
alternatives,  and  are  preferable  to  the  above  methods.  Chrlstof ferson 
(1975)  and  Muthen  (1978)  have  presented  methods  for  the  factor  analysis  of 
dichotomous  items  that  are  computationally  feasible.  Another  method  is 
proposed  by  Bartholomew  (1980).  Bock  and  Altken  (1981)  have  suggested  a 
marginal  maximum  likelihood  method,  based  on  multidimensional  TRT  models 
that  may  Include  guessing  terms.  The  more  general  approach  of  covariance 
structure  analysis  could  also  be  used.  Some  of  these  methods  provide  chi 
square  tests  of  goodness  of  fit,  or  of  the  contribution  of  successive 
factors  added  to  the  model. 

D.4.  Local  Independence  should  be  examined.  Desirable. 

Unidlmenslonallty  and  the  IRT  model  Imply  the  property  of  local 
Independence.  In  the  IRT  model,  ability  Is  the  only  source  of  association 
between  Items.  Thus,  for  persons  with  the  same  ability,  the  Items  should 
be  Independent.  Tests  of  the  local  Independence  hypothesis  are  being 
developed  by  Holland  (1981)  Levine  (Note  1),  and  Stout  (Note  4),  but  are 
not  ready  for  practical  use.  They  all  rely  on  Indirect  assessment,  by 
deducing  certain  consequences  of  local  independence,  and  testing  the 
occurrence  of  such  consequences.  We  are  particularly  Interested  In  Stout’s 
proposal,  and  urge  trying  It  when  It  becomes  available. 

D5 .  Subtests  should  be  formed  when  tests  are  not  unidimensional. 

Desirable. 


Many  tests  are  not  perfectly  unldlmenslonal.  The  Items  tend  to  cluster  by 


^Samejlma's  result  is  as  follows.  In  these  formulas,  P  and  p  Indicate  the 
observed  and  the  modified  proportions,  respectively,  g  and  ^  ,  or  h  and  K, 
denote  the  correct  and  Incorrect  answers  to  item  g  or  h,  and  Cg  or  c}, 

Is  the  guessing  parameter  of  Item  g  or  h,  which  is  unity  divided  by  the 
number  of  alternatives. 
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content  area.  Achievement  tests,  and  tests  of  general  knowledge  are 
especially  prone  to  such  clustering.  Such  tests  can  generally  be  viewed  as 
having  a  single  dominant  factor,  plus  several  small  group  factors,  but  they 
can  also  be  viewed  as  having  several  highly  correlated  group  factors.  When 
the  single  dominant  factor  Is  sufficiently  dominant,  or  equivalently,  when 
the  group  factors  are  sufficiently  highly  correlated,  then  the  test  may  be 
treated  as  unldlmenslonal .  Otherwise,  It  would  be  better  to  treat  each 
cluster,  or  group  factor  as  a  subtest. 

The  precise  criterion  for  deciding  which  to  do  Is  not  easy  to  specify. 
A  single  factor  that  accounts  for  70%  of  the  total  common  variance  Is 
probably  strong  enough;  one  that  accounts  for  less  than  50%  probably 
signals  the  use  of  subtests.  If  a  correlated  common  factor  model  Is  used, 
we  recommend  rotating  the  results  to  a  single  general  factor  plus  an 
orthogonal  residual  group  space,  so  that  these  criteria  can  apply. 

Another  way  of  assessing  unidimensionality  applies  to  tests  In  which 
the  Items  can  be  sorted  Into  clusters  on  the  basis  of  Item  type  or  Item 
content.  In  that  case,  scores  can  be  obtained  on  the  separate  subgroups  of 
Items.  In  these  cases,  for  a  large  heterogeneous  sample  of  test  takers, 
separate  scores  should  be  obtained  for  each  subtest,  and  these  suhscores 
Intercorrelated.  Estimates  of  the  reliability  of  each  subscore  should  also 
be  made,  and  the  intercorrelation  should  be  corrected  for  unreliability 
("dlsattenuated”).  If  the  corrected  correlations  are  sufficiently  high, 
(say  about  .9)  the  test  can  be  considered  unitary. 

Finally,  when  separate  tailoring  Is  done,  there  are  two  scores  which 
must  at  some  point  be  combined.  Also,  If  a  variable-stopping  criterion  Is 
used,  there  are  two  stopping  criteria.  Note  that  we  are  not  proposing  to 
replace  one  test  score  by  two  or  more  subtest  scores  when  there  is  some 
lack  of  unidimensionality  or  some  separate  content  areas.  No  doubt  the 
subtests  will  be  quite  highly  correlated,  and  each  will  be  less  reliable 
than  a  test  score  should  be.  Thus  the  subscores  will  generally  be 
unsuitable  for  separate  use  either  in  prediction  or  In  counselling  and 
should  not  be  reported  separately.  Rather  the  subscores  should  be  combined 
as  suggested.  Weights  for  the  combination  of  subscores  Into  one  test  score 
could  for  example,  be  chosen  so  that  the  resulting  score  has  maximum 
correlation  with  the  paper-and-pencil  version  of  the  test. 

We  note  that  comparability  with  paper  and  pencil  tests  may  suggest 
Intermixing  the  Items  on  the  subtests.  This  can  he  done.  In  principal,  but 
the  equipment  must  be  able  to  keep  track  of  two  or  more  simultaneous 
adaptive  tests.  This  Is  not  difficult  on  general-purpose  microcomputers 
but  it  Is  a  requirement  to  keep  In  mind  when  obtaining  the  equipment.  This 
would  provide  the  possibility  of  multidimensional  adaptive  testing  In  the 
future,  that  could  take  advantage  of  correlations  between  the  underlying 
abilities  while  obtaining  an  estimated  score  on  each. 

D6.  Tests  should  be  balanced  for  content  and/or  Item  type.  F.ssentlal. 

Some  tests  have  heterogeneous  content,  or  use  two  or  more  different  Item 
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types,  or  both.  The  ASVAB  Includes  two  tests  with  heterogeneous  content. 
Auto  &  Shop  Information  contains  Items  about  autos  and  Items  about  shop. 
General  science  Items  can  be  sorted  Into  natural  science  (physics  and 
chemistry),  biology,  and  health  and  nutrition.  These  tests  will  often  be 
not  strictly  unldlmenslonal ,  but  may  fit  the  criteria  of  the  previous 
section.  If  the  test  Is  treated  as  two  or  more  subtests,  then  each  test 
taker  will  necessarily  face  Items  of  each  type,  or  content.  But  If  the 
test  Is  administered  as  a  unit,  then  Items  should  be  selected  In  a  balanced 
way  so  that  as  nearly  as  possible,  each  person  gets  the  same  number  of 
Items  from  each  content  area.  This  not  only  makes  the  test  comparable  to 
the  paper-and-pencll  version,  but  It  balances  the  Items  In  cases  of  any 
biases  for  subpopulatlons.  For  example.,  Bock  &  Mlslevy  (1981)  found  that, 
on  the  General  Sciences  Test,  to  a  small  extent,  males  scored  relatively 
higher  than  females  on  natural  science  Items,  and  females  scored  relatively 
higher  than  males  on  Items  relating  to  health  and  nutrition.  Proportional 
representation  of  all  areas  will  tend  to  balance  these  differences  on  the 
test  as  a  whole. 

We  note  that  the  best  way  to  decide  whether  or  not  to  balance  a 
particular  test  for  content  Is  by  empirical  study.  Balance  may  or  may  not 
be  Important.  Are  there  differences  between  Identifiable  groups  on  the 
different  Item  clusters?  Does  any  such  difference  Imply  a  relative 
advantage  If  content  Is  not  balanced  on  the  adaptive  test? 

When  the  test  Is  tailored  and  administered  as  a  unit,  balancing  the 
Items  for  content  and  type  requires  a  hard  choice.  If  the  Items  are 
selected  alternately  from  the  various  subgroups,  the  Item  subsets  will  also 
be  nearly  balanced  for  difficulty,  but  the  candidate  must  keep  switching 
contexts,  which  would  be  especially  bad  with  different  Item  types.  If,  on 
the  other  hand,  several  Items  are  selected  from  one  subset,  then  several 
from  another,  and  so  on,  the  candidate  need  not  switch  content  so  often, 
but  content  type  Is  somewhat  confounded  with  Item  difficulty.  This  Is  not 
a  severe  problem  If  the  test  Is  very  nearly  unldlmenslonal.  We  favor  this 
second  procedure.  Whenever  the  Interaction  of  difficulty  with  content 
could  be  a  problem  (because  some  persons  are  better  on  one  content,  others 
on  another)  then  separate  subtests  are  preferable.  We  note  again,  however, 
that  this  Is  opinion.  Empirical  evidence  Is  needed  to  determine  the 
relative  advantages  and  disadvantages  of  each  procedure.  There  Is 
obviously  a  limit  to  how  finely  the  content  should  be  subdivided.  Each 
Item  Is  to  a  large  extent  specific.  There  will  always  be  some  persons  who 
happen  to  know  more  about  one  Item  than  another.  So  long  as  this  Is  either 
due  to  the  dimension  being  tested,  or  Is  unrelated  to  that  dimension,  the 
specific  aspects  should  tend  to  average  out  over  a  number  of  Items. 

Careful  experimental  study  of  the  problem  Is  needed. 

One  final  problem  arises  in  connection  with  the  paragraph 
comprehension  test.  In  the  current  ASVAB  forms  there  are  several  Items  per 
paragraph.  These  Items  are  Invariably  more  highly  Intercorrelated  than  are 
Items  from  different  paragraphs.  This  violates  the  principal  of  local 
Independence  of  items,  which  Is  central  to  IRT,  and  hence  to  CAT.  Thus  In 
tailored  testing.  Ideal  Items  would  have  only  one  question  per  paragraph. 
Efficiency  would  then  require  short  paragraphs.  We  note  elsewhere  that  the 
capacity  of  the  display  screen  limits  the  length  of  the  paragraph 


comprehension  questions.  However,  shorter  items,  with  one  question  per 
item  may  measure  a  somewhat  different  skill.  Empirical  evidence  on  the 
comparability  of  item  types  is  Important.  If  mpirical  research  shows  that 
there  is  noticeable  difference,  the  PSP  version  of  the  ASVAB  should  be 
changed  to  match  the  CAT  version. 

Finally,  if  multiple-item  paragraphs  are  used  in  the  CAT  version,  some 
form  of  multiple  response  model  should  be  used  that  does  not  assume 
conditional  Independence  among  responses  to  the  same  paragraphs  (e.g. 
Samejlma,  1969).  In  this  case,  tailoring  will  have  to  operate  with  respect 
to  paragraphs  rather  than  items,  but  the  greater  precision  of  these  types 
of  items  may  allow  the  number  of  paragraphs  presented  to  be  less  than  the 
number  of  items  presented  in  those  tests  where  all  items  are  independent. 


Reliability  and  Measurement  Error 


In  classical  test  theory,  reliability  Is  defined  as  the  ratio  of  the 
true  score  variance  to  the  observed  score  variance  in  the  population  of 
persons  from  which  the  examinees  are  assumed  to  be  randomly  sampled.  This 
quantity  can  be  expressed  as  the  Intraclass  correlation. 
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where  is  the  observed  score  variance  and  Is  the  measurement  error 

variance.  The  correlation,  p  ,  Is  estimated  directly  by  the  customary 
Indices  of  reliability,  such  as  parallel-form  or  test-retest  reliability, 
and  indirectly,  by  split-half  reliability;  Cronbach’s  alpha  provides  a 
lower  bound  to 

By  a  simple  algebraic  manipulation,  the  measurement  error  variance  can 
be  expressed  as 
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the  standard  error  of  measurenent  Is  then 


a  = 
e 


Although  the  reliability  Is  a  convenient  unitless  number  between  0  and  1, 
the  standard  error  of  measurement  Is  more  useful  in  score  Interpretation. 

The  formulas  above  are  used  almost  universally  In  test  practice,  but 
they  make  use  of  the  generally  false  assumption  that  the  error  variance  Is 
the  same  for  all  scores,  rather  than  being  dependent  on  ability.  It  is 
widely  recognized  that  the  measurement  error  is  not  constant  for 
conventional  tests,  being  larger  at  the  extremes  of  the  ability 
distribution  and  smaller  near  the  mean.  Since  the  classical ^formulation 
above  uses  a  single  average  value  for  the  error  variance,  a  q,  the 
conventional  reliability  coefficient  Is  at  best  a  crude  description  of  the 
true  state  of  affairs. 

As  Samejlma  (1977a)  has  pointed  out,  this  definition  of  reliability 
has  little  relevance  for  measurement  based  on  Item  response  theory,  where 


the  error  variance  is  expressed  as  a  function  of  ability.  In  item  response 
theory,  t^e  estimate  of  error  variance  is  expressed  as  the  variance  of  estimated 
ability,  0,  for  a  fixed  value  of  ability,  U.  The  estimate  of  error  variance 
will  depend  on  the  method  used  to  estimate  8.  Using  classical  statistical 
theory,  with  maximum  likelihood  estimation  of  9,  the  variance  of  measurement 
error  is  given  by  the  reciprocal  of  the  Informaton  formation,  as  noted  in  the 
introduction.  In  Bayesian  theory,  the  error  variance  is  also  readily  computed. 


In  adaptive  testing,  the  measurement  error  variance  depends  on  the 
stopping  rule.  One  stopping  rule  is  based  directly  on  the  error  variance 
or  its  reciprocal,  the  information  function:  all  examinees  are  tested  to 
the  same  value  of  the  error  variance,  or  information,  over  as  wide  a  range 
of  ability  as  practical.  In  that  case,  the  estimated  standard  e^ror  of 
measurement  is  constant.  If  the  item  pool  is  not  large  enough  to  support  a 
uniform  Information  criterion,  or  If  some  other  stopping  rule  is  used,  such 
as  a  fixed  number  of  items,  then  the  error  variance  will  still  depend  on 
the  ability  level,  0. 

One  further  practical  problem  in  the  use  of  IRT  is  the  score  scale. 

The  natural  scale  for  IRT  theory  is  the  ability  scale,  0.  This  scale  is 
non-linearly  related  to  the  conventional  score  scale,  which  is  used  on  the 
pa per- and- pencil  ASVAB.  Since,  at  least  at  present,  it  may  desirable  to 
transform  the  CAT  scores  to  expected  number-right  scores,  the  standard 
error  of  measurement  will  again  be  a  function  of  the  score  value.  The 
expected-number-right  score  is  a  strong  monotonic  function  of  0,  and  is 
given  by 
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This  discussion  leads  to  the  following  recommendation. 

El .  The  standard  error  of  measurement  of  each  test  score  should  be 
reported  as  a  function  of  the  test  score,  in  the  metric  of  the  reported 
score.  Either  a  graphical  or  tabular  report  form  or  both,  can  be  used. 
Essential. 

E2 .  The  standard  error  of  measurement  of  each  test  should  also  be 
reported  in  the  ability  metric,  as  a  function  of  the  test  score  unless  the 
Btandard  error  is  constant.  Desirable. 

This  recommendation  is  for  the  convenience  of  further  psychometric  analysis 
of  the  test.  This  information  will  be  useful  only  in  connection  with  the 
actual  item  parameters,  and  should  be  kept  separate  from  the  report  in  E.l. 
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Reliability.  Custom  suggests  that  a  unltlef:,s  Index  of  reliability 
also  be  provided,  although  such  an  index  Is  somewhat  contrived.  Many 
psychometricians  feel  that  devising  a  reliability  coefficient  for  an 
adaptive  test  Is  Inappropriate  and  misguided.  Nevertheless,  If  one  Is 
determined  to  have  a  reliability  coefficient  In  Item  response  theory,  there 
are  two  possibilities.  One  approach  Is  to  define  a  conditional 
reliability,  l.e., 
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which  could  be  graphed  or  specified  at  selected  values  of  O.  This  would  be 
the  reliability  If  everyone  were  measured  with  the  same  precision  as  those 
persons  with  ability  6.  This  function  Is  somewhat  like  the  Information 
function  except  that  It  has  the  convenient  property  of  being  unitless. 

The  other  possibility  Is  to  define  an  average  or  marginal  measurement 
error.  In  a  population  with  ability  distribution  g  (0), 

+00  2 

/  a^(9)g(0)d  e 

2  e 

0  =  - - - - - — 

em 

U  g(e)do 


where,  if  ability  is  normally  distributed,  these  integers  can  be  evaluated 
by  Gaussian  quadrative,  as  discussed  by  Bock  &  Lieberman  (1970).  Then 
marginal  reliability  can  be  defined  as 
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E3.  Reliability  should  be  reported  by  the  marginal  reliability  Index.  If 
testing  is  to  a  fixed  error  criterion,  this  Is  equivalent  to  classical 
reliability.  Very  desirable. 

EA .  Conditional  reliabilities  should  be  reported  at  selected  points  on 
the  ability  scale.  Very  desirable. 

E5 .  The  precision  of  P&P  and  CAT  versions  of  the  ASVAB  tests  should  be 
compared.  Essential. 

For  comparative  purposes  It  will  be  necessary  to  show  the  precision  of  the 
current  P&P  (paper  and  pencil)  versions  of  the  ASVAB.  We  recommend  finding 
item  parameters  for  one  current  form  of  each  test  to  obtain  the  test 
Information  function  and  thus  the  measurement  error  variance.  It  is 
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recognized  that  the  OAT  will  be  more  efficient  because  of  Its  adaptive 
character  and  also  because  the  stringent  criteria  for  Item  selection  will 
result  In  more  discriminating  items  on  the  adaptive  test.  Still  the 
Important  practical  question  Is  the  overall  gain.  Such  a  demonstration 
will  not  be  difficult.  I!1T  Item  parameters  have  been  obtained  for  at  least 
one  form  of  the  ASVAR,  both  by  Ree  (private  communication)  and  by  Bock  & 

Mlslevy  (1981).  Of  course  the  parameters  must  be  obtained  by  the  same 
statistical  procedure  that  Is  used  for  the  CAT  Items,  so  additional  work 
may  still  be  needed. 

Eventually  It  will  be  desirable  to  compare  the  precision  of  each  CAT 
with  the  precision  of  a  hypothetical  P&P  test  formed  from  the  CAT  item 
pool,  picking  Items  that  matched  the  actual  P&P  test  in  difficulty  but 
matched  the  CAT  pool  In  discrimination.  Such  a  comparison  would  Indicate 
how  much  of  CAT's  efficiency  Is  due  to  better  items  and  how  much  to 
tailoring. 


Empirical  reliability. 

The  above  definitions  refer  to  error  due  to  sampling  of  Items  from  an 
indefinitely  large  pool.  They  do  not  Include  variability  due  to  short-run 
random  variation  of  the  trait  being  measured  or  to  situational  variance  In 
the  testing  conditions.  These  sources  of  error  can  only  be  assessed 
empirically.  It  would  be  desirable  to  estimate  the  extent  of  this  type  of 
variation  by  readmlnlsterlng  the  adaptive  test  on  successive  days  or  weeks, 
with  the  condition  that  Items  presented  to  the  same  suh.lect  are  sampled 
without  replacement.  Pearson  product-moment  correlations  of  the  paired 
measurements  would  serve  to  estimate  the  empirical  reliability.  Because 
the  items  are  not  repeated,  the  reliability  determined  In  this  way  Is 
equivalent  to  classical  alternate-form  reliability.  Because  of  the  way  the 
Items  are  selected,  this  reliability  might  be  called  stratified, 
randomly-parallel  form  reliability.  Because  the  scores  are  obtained  on 
different  days,  test  fatigue  would  be  avoided;  because  the  days  are  close 
In  time,  this  reliability  coefficient  would  Indicate  the  short-term 
stability  of  the  scores. 

E.5.  Alternate-form  reliability  should  be  determined  empirically  for  each 
test  in  the  battery.  Essential. 

It  should  be  noted  that  most  decisions  are  based  not  on  individual 
test  scores  but  on  various  composites.  Initial  entry,  for  example.  Is 
based  on  the  AFQT  composite.  Although  there  are  too  many  composites  to 
enable  calculating  reliabilities  for  each  composite,  reliabilities  can  be 
computed  for  the  most  widely  used  composites,  and  the  data  should  be 
available  for  computing  reliabilities  of  arbitrary  composites.  This 
requires  the  reliabilities  of  the  Individual  tests,  and  the 
intercorrelation  matrix  of  the  tests. 

E6.  Reliability  of  widely  used  composite  scores  should  be  reported. 

Highly  desirable. 

E7 .  Test  inter correlations  should  be  reported.  Highly  desirable. 


Validity  and  differential  prediction. 

Before  switching  from  paper  and  pencil  (P&P)  versions  of  the  ASVAB  to 
a  computerized  adaptive  test  (CAT),  It  Is  Important  to  have  evidence  that 
the  CAT  Is  at  least  as  valid  as  the  P&P  version  of  the  battery.  During  a 
transition  stage  when  P&P  and  CAT  are  both  In  operational  use,  it  would 
also  be  Important  to  have  evidence  that  scores  on  the  two  test  forms  have 
the  same  predictive  meaning.  Included  In  the  latter  category  would  be 
Investigations  of  possible  differences  in  CAT  prediction  equations  for  key 
subpopulations  (e.g.,  differential  prediction  as  a  function  of  race  or 
sex) . 

Comparisons  of  variance-covariance  matrices  and  covariance  structures 
are  needed.  A  comparison  of  Interrelationships  among  the  tests  for  the  two 
forms  would  provide  an  Initial  check  for  possible  differences  in  the 
validities  of  the  CAT  and  P&P  tests.  Correlations  among  the  tests  may  be 
altered  due  to  differences  In  the  precision  of  measurement  at  different 
ability  levels,  with  the  CAT  version  expected  to  yield  better  measurement 
at  the  extremes.  When  based  on  different  samples,  the  correlations  would 
also  be  expected  to  vary  as  a  function  of  group  heterogeneity.  For  these 
reasons,  comparisons  of  variance-covariance  matrices  and  of  covariance 
structures  will  be  more  Informative  than  comparisons  of  correlation 
matrices . 

The  predictive  validity  of  the  P&P  version  of  the  ASVAB  has  been 
well-documented  (Department  of  Defense,  1980;  Flschl  et  al,  1978.)  If  the 
CAT  versions  of  the  tests  are  highly  related  to  their  respective  P&P 
counterparts,  and  If  the  covariance  structures  are  similar,  similar 
predictive  validity  can  be  Inferred.  There  are  various  opinions  about 
potential  differences  in  validity  for  adaptive  tests  in  general.  The  CAT 
tests  may  possibly  be  more  nearly  unldlmenslonal  than  the  conventional  P&P 
tests,  but  the  purity  may  be  seen  as  clarity,  implying  Improved  validity, 
or  as  sterility.  Implying  poorer  validity.  Kingsbury  &  Weiss  (1981)  and 
Sympson  &  Weiss  (Note  6)  claimed  to  be  optimists,  but  in  fact  found  very 
little  difference  between  the  validity  of  the  two  modes. 

VI.  The  similarity  of  variance-covariance  matrices  should  be  assessed. 
Essential. 

The  most  straightforward  comparison  Is  a  test  of  the  hypothesis  of  the 
variance-covariance  matrices  for  the  CAT  and  P&P  versions,  ^'p. 
Procedures  for  testing  this  hypothesis  are  well  known  (e.g..  Box,  1950). 

The  main  requirements  for  the  purposes  of  the  CAT  vs.  P&P  comparison  Is 
that  (1)  the  two  versions  of  the  tests  are  administered  to  random  samples 
from  the  same  population,  and  (2)  that  the  number  of  examinees  taking  each 
version  is  relatively  large  (say,  200  or  more). 


Assuming  that  the  CAT  will  be  experimental.  It  will  be  subject  only  to 
incidental  selection.  Comparable  P&P  data  would  then  require  a  special 
administration  (retesting)  with  another  P&P  form,  so  that  P&P  and  CAT 
scores  are  both  subject  to  Incidental  selection,  and  so  that  both  scores 
are  obtained  In  retesting,  with  as  nearly  similar  motivational  conditions 
as  can  be  managed,  and  with  order  of  testing  counterbalanced.  Other 
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experimental  designs  are,  of  course,  possible. 

Although  the  comparison  of  variance-covariance  matrices  Is  Important 
It  will  not  provide  a  direct  Indication  of  the  source  of  any  differences 
that  are  found.  For  that  purpose,  the  comparison  of  covariance  structures 
outlined  In  the  following  section  should  be  more  useful. 

V.2.  The  covariance-structures  of  the  two  versions  should  be  compared. 

Very  desirable. 

It  Is  recommended  that  the  factor  structures  of  the  CAT  and  P&P  versions  be 
compared.  Using  subscripts  p  and  c  to  designate  the  P&P  and  CAT  results 
respectively,  the  general  form  of  the  two  n  x  n  variance-covariance 
matrices  would  be: 
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£  =A  4>  A  +4' 
c  c  c  c  c 


and 


I  =A 

P  P  P  P  P 


After  some  exploratory  analyses  of  results  for  a  large  sample  of  P&P 
results,  a  hypothesized  pattern  of  zeros  and  free  parameters  In  p  would 
be  determined  for  m  n  factors. 

In  the  Initial  data  collection,  it  may  be  necessary  to  limit  the 
computerized  testing  to  the  seven  of  the  ten  ASVAB  tests  that  do  not 
involve  graphics.  Those  seven  tests  might  be  hypothesized  to  have  a 
pattern  as  shown  below  with  X*s  indicating  free  parameters  and  zeros  fixed 
parameters. 
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Initial  comparison  of  factor  patterns  and  structure  will  be  limited  to 
seven  tests.  VHien  graphical  capabilities  are  added,  the  analyses 
illustrated  above  should  be  repeated  with  the  complete  battery  of  ten 
tests.  For  the  ten-test  battery,  we  expect  the  following  4-factor  pattern. 
Here  we  separate  large  loadings  X,  medium  loadings  x,  and  zero  loadings,  0. 
In  a  confirmatory  analysis,  all  x's  would  be  free  parameters. 


Factor 


General  Sciences 
Arithmetic  Reasoning 
Word  Knowledge 
Paragraph  Comprehension 
Numerical  Operations 
Coding  Speed 
Auto  &  Shop  Information 
Mathematics  Knowledge 
Mechanical  Comprehension 
Electronic  Information 


0X0  X 

X  0  0  0 

0X0  0 

0X0  0 

0  0  X  0 

0  0  X  0 

0  0  0  X 

X  0  0  0 

X  0  0  X 

0x0  X 


1.  Equal  factor  loadings:  The  first  constraint  to  be  Imposed  Is  that 
Ac»Ap,  All  variances  and  covariances  In  ^  and  would  be  free 

and  not  constrained  to  be  equal.  The  matrices  would  also  be  unconstrained 
diagonal  matrices. 

2.  More  constrained  models:  Additional  constraints  that  s'c  °  4'p 
and/or  that  ‘*’c  *  could  also  be  added.  It  seems  likely,  however, 
that  different  matrices  would  be  required. 

It  Is  assumed  that  Form  8,  9  or  10  will  be  used  for  the  P&P  tests. 
Comparisons  will  also  be  needed  between  CAT  and  Form  11,  12,  or  13  before 
the  CAT  is  made  operational.  This  Is  Important  because  If  the  CAT  becomes 
operational  It  will  be  used  along  with  Forms  11,  12,  and  13.  Thus,  the 
above  comparisons  should  be  repeated  using  Form  11,  12  or  13  for  the  P&P 
version.  Alternatively,  the  comparisons  of  all  three  versions  could  be 
made  simultaneously  by  administering  Form  8,  9,  or  10  to  one-third  of  the 
sample.  Form  11,  12,  or  13  to  one-third  of  the  sample  and  the  CAT  to  the 
remaining  third. 

As  In  the  simple  comparison  of  the  covariance  matrices.  It  is 
Important  that  the  results  be  based  on  sizeable  random  samples  from  the 
same  population.  Several  available  computer  programs  are  capable  of 
performing  the  above  analyses.  One  of  the  better  known  programs  Is  LISREL 
V  (Joreskog  &  Sorbom,  1978).  In  addition  to  providing  chi-square  tests  of 
the  hypotheses  suggested  above,  standard  errors  of  the  parameter  estimates 
and  residual  differences  between  the  sample  variances  and  covariances  and 
those  estimated  by  the  model  may  be  obtained.  Both  the  standard  errors  and 
the  residuals  should  be  reported.  They  will  be  useful  for  purposes  of 
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judging  the  practical  Importance  of  any  statistically  significant 
differences  that  are  obtained. 

V3.  The  CAT  battery  should  be  validated  using  external  criterion 
measures.  For  comparison,  the  P&P  battery  should  be  validated  using  the 
same  criterion  measures.  Essential. 

The  analyses  of  the  covariance  structures  as  outlined  above  will  provide  a 
good  test  of  the  extent  to  which  the  CAT  and  P&P  versions  are  measuring  the 
same  abilities.  It  will  nonetheless  be  Important  to  compare  directly  the 
prediction  equations  of  the  CAT  version  with  those  developed  for  the  P&P 
version  with  a  few  Important  criterion  measures.  If  nothing  else,  It  would 
be  useful  for  purposes  of  satisfying  skeptics.  Such  comparisons  will  be 
essential,  however.  If  the  two  versions  of  the  tests  are  found  to  have 
different  covariance  structures.  The  latter  outcome  also  seems  quite 
likely  since  the  CAT  will  have  nearly  equal  precision  across  the  range  of 
test  scores,  whereas  the  P&P  test  Is  much  less  accurate  at  the  extremes 
relative  to  the  middle  of  the  ability  distribution. 

For  several  reasons,  comparisons  of  CAT  and  P&P  correlations  with 
criterion  measures  may  not  be  satisfactory.  As  was  just  Indicated  the 
precision  of  relative  efficiency  of  the  two  versions  Is  apt  to  differ  as  a 
function  of  ability  level.  Also,  the  samples  for  which  criterion  data 
could  be  obtained  will  generally  be  subject  to  explicit  selection  on  the 
P&P  tests  but  only  Incidental  selection  on  CAT.  Thus,  the  P&P  correlations 
would  be  expected  to  be  affected  more  by  selection  effects  than  the  CAT 
correlations  would  he. 

The  comparisons  of  primary  Interest  can  all  be  classified  under  the 
heading  of  differential  prediction.  Comparisons  of  regression  systems. 
Including  error  variances,  slopes  and  intercepts  are  all  relevant. 
Comparisons  of  two  types  of  regression  equations  should  be  made:  (1) 
regression  of  a  criterion  measure  on  subtest  scores  and  (2)  regression  of  a 
criterion  measure  on  test  composite  scores.  To  the  extent  that  It  Is 
feasible.  It  would  also  be  desirable  to  compare  the  conditional  variances 
on  the  criterion  measure  as  a  function  of  test  score.  It  might  be  expected 
that  the  conditional  variances  would  be  smaller  for  extreme  CAT  scores  than 
for  the  corresponding  scores  on  the  P&P,  whereas  the  conditional  variances 
would  be  more  nearly  equal  In  the  middle.  The  possibility  of  nonlinearity 
would  also  be  worth  Investigating  to  the  extent  that  this  Is  feasible. 

Although  differences  In  error  variances  would  be  a  concern,  the  more 
serious  concern  would  be  with  differences  In  slopes  and/or  Intercepts.  The 
latter  types  of  differences  would  Imply  that  systematic  errors  of 
prediction  would  result  from  using  CAT  scores  In  place  of  P&P  scores.  For 
schools  that  have  minimum  entry  requirements,  the  predicted  criterion 
scores  for  test  scores  near  the  minimum  deserve  special  attention.  If  the 
predicted  value  on  the  criterion  Is  significantly  higher  (lower)  for  the 
CAT  than  for  the  P&P,  then  Individuals  who  take  the  CAT  would  be  given  an 
unfair  disadvantage  (advantage).  The  Johnson-Neyman  (1936)  procedure  could 
be  used  to  determine  If  the  minimum  score  fell  In  a  region  of  significant 
differences.  If  there  are  significant  differences  In  predicted  scores 
associated  with  test  scores  at  the  cutoff,  then  different  entry 
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requirements  would  be  needed  for  the  two  test  versions.  Such  a  finding 
would  also  suggest  that  a  more  comprehensive  series  of  differential 
prediction  studies  would  need  to  be  undertaken  to  determine  the 
generallzablllty  of  differences  In  prediction  for  other  training  areas. 

The  schools  and  criterion  measures  to  be  used  will  need  to  be 
determined  on  the  basis  of  feasibility  and  importance  of  selection  for  the 
various  specialty  schools.  They  should  have  some  variety  (e.g.,  auto 
mechanic,  clerk-typist,  electronics  and  Infantry).  It  is  recommended  that 
differential  prediction  studies  be  conducted  for  at  least  three  schools, 
and  preferably  many  more.  For  each  school  there  should  be  a  minimum  of  100 
persons  with  CAT  scores  and  another  100  or  more  with  P&P  scores.  The  P&P 
scores  should  be  obtained  from  a  special  administration  rather  than  from 
the  files  In  order  to  avoid  differences  due  to  motivational  differences  for 
the  special  administration  in  comparison  to  regular  administration  at  time 
of  entry  Into  the  service.  Final  course  grades,  performance  tests  and 
attrition  might  all  serve  as  criteria.  For  selected  jobs  In  the  Skills 
Qualification  Test  results  might  also  serve  as  criteria. 

The  use  of  only  100  cases  In  each  group  will  be  adequate  to  detect 
gross  differences  In  prediction  equations  and  will  be  satisfactory  as  a 
first  step.  If  at  all  possible,  200  cases  would  be  much  better.  Still, 
small  differences  In  regressions  cannot  be  detected  with  fewer  than  500 
cases  In  each  group,  and  subtle  differences  need  even  more  cases,  or  a 
different  approach  to  aggregation.  We  recommend  that  attempts  be  made  to 
gather  enough  data  for  such  comparisons,  although  we  realize  that  this 
could  not  happen  before  widespread  use  of  the  CAT  battery. 

V4.  The  extent  of  prediction  bias  should  be  assessed  for  Important 
subpopulations .  Desirable. 

It  would  be  desirable  to  compare  prediction  systems  based  on  samples  from 
Important  subpopulations.  Of  special  interest  is  the  possibility  that  a 
prediction  system  based  on  results  for  men  yields  biased  predictions  for 
women  or  that  one  based  on  majority  group  results  yields  biased  predictions 
for  Blacks  or  Hispanic  persons.  The  major  obstacle  to  Investigating  the 
possibility  that  the  CAT  leads  to  biased  predictions  for  members  of 
particular  subpopulations  will  be  sample  size.  For  useful  comparisons  it 
Is  desirable  that  CAT  results  and  criterion  results  be  available  for 
approximately  100  or  more  members  of  each  subpopulation.  If  It  Is  feasible 
to  obtain  samples  of  this  size  for  a  particular  school,  then  standard  tests 
for  the  homogeneity  of  error  variances,  slopes  and  Intercepts  should  be 
conducted.  If  significant  differences  are  obtained,  then  the  direction  and 
amount  of  bias  would  need  to  be  examined  as  a  function  of  scores  on  the 
CAT,  and  requires  even  more  cases.  Bias  In  prediction  near  the  minimum 
score  for  entry  Into  a  school  would  be  of  special  concern. 
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Item  Parameters  -  Estimation 


The  parameters  of  the  Item  response  curve  for  each  Item  In  the  test 
pool  play  a  central  role  In  adaptive  testing.  The  choice  of  Items  to 
present  to  each  person,  and  the  score  derived  for  each  person  depend 
critically  on  the  Item  parameters.  Without  good  Items  and  good  estimates 
of  the  IRT  parameters,  useful  ability  estimates  will  not  be  obtained, 
regardless  of  the  quality  of  the  other  components.  The  Item  parameter 
estimates  are  used  for  Item  selection,  ability  estimation,  and  to  compute 
test  Information.  If  the  parameter  estimates  are  poor,  none  of  the  other 
procedures  can  give  meaningful  results.  Therefore  It  Is  of  utmost 
Importance  that  the  calibration  be  done  properly  and  that  evidence  be 
presented  to  show  the  quality  of  the  results. 

The  following  sections  will  make  recommendations  concerning  how 
estimation,  linking,  equating,  and  Item  pool  production  should  be  done, 
based  on  the  best  current  Information  and  judgment.  New  research  In  the 
area  may  require  changes. 

Item  calibration.  The  term  Item  calibration  Is  used  here  to  mean  the 
estimation  of  IRT  Item  parameters  for  each  Item  In  the  Item  pool  for  a 
test.  These  parameter  estimates  are  usually  obtained  from  one  of  several 
calibration  programs  that  are  available,  following  the  background 
discussion.  It  will  be  assumed  that  a^  Is  the  discrimination  or  slope 
parameter,  ^  Is  the  difficulty,  or  threshold  parameter,  and  £  Is  the  lower 
asymptote,  or  pseudo-guessing  parameter.  These  parameters  will  be  assumed 
to  be  In  normal  ogive  form.  That  Is,  If  a  logistic  model  Is  used,  the 
constant  D  »  1.7  Is  Included  In  the  model,  as  In  the  presentation  given 
above  In  the  section  labelled  "Background.” 

The  primary  requirement  In  determining  Item  parameters  Is  having 
enough  cases  to  yield  stable  estimates.  Although  the  sample  size 
requirements  for  the  various  calibration  programs  vary,  the  current 
literature  (Lord,  1968;  Reckase,  1978;  Ree,  1979,  1981)  seems  to  Indicate 
that  at  least  1,000  cases  are  necessary  for  si  ’’’e  calibration.  This  a 
firm  lower  limit.  A  larger  sample  Is  desirable.  Our  general  recommendation 
follows . 

lEl.  The  sample  for  Item  calibration  should  be  of  adequate  size,  currently 
at  least  1000  cases.  Essential. 

As  a  corollary,  any  new  procedure  for  Item  calibration  Is  likely  to  need 
the  same  sample  size.  However,  the  requirement  of  1000  cases  Is  the  result 
of  empirical  test.  Thus,  when  considering  a  new  procedure,  the  sample  size 
requirements  must  be  reevaluated  using  both  simulation  and  live  data 
studies.  For  the  simulation  studies,  samples  of  Item  response  vectors 
should  be  generated  using  the  model  selected  as  a  basis  for  the  CAT  system 
for  the  test  length  to  be  used  In  Item  calibration  and  using  realistic 
assumptions  about  error.  Several  different  sample  sizes  should  he  produced 
so  the  effect  on  calibration  can  be  determined.  These  samples  should  then 
be  used  to  determine  the  Item  parameters  of  the  simulated  items.  These 
estimated  parameters  can  then  be  compared  to  the  Item  parameters  used  to 
generate  the  data  to  determine  the  adequacy  of  the  sample  size.  Roth 
squared  deviation  and  absolute  deviation  statistics  have  been  used  for  the 


comparison  in  the  past.  Another  check  that  has  not  frequently  been  used  In 
the  literature,  but  that  we  advocate,  is  to  compare  empirical  and 
theoretical  item  response  curves.  In  estimating  parameters,  one  also 
estimates  ability  values  for  the  persons,  which  then  permits  determining 
the  empirical  curves  for  comparison. 

The  live  data  studies  can  be  performed  by  calibrating  a  test  on  a 
large  sample  that  Is  well  beyond  the  sample  size  expected  to  be  required 
for  accurate  calibration,  and  use  those  results  as  a  basis  for  evaluating 
the  quality  of  smaller  sample  calibrations.  As  with  the  simulation 
procedure,  a  squared  deviation  or  absolute  deviation  statistic  can  be  used 
to  judge  the  similarity  of  the  parameter  estimates  from  the  small  sample 
and  large  sample  calibrations.  The  goal  is  to  determine  the  point  where  an 
Increase  in  sample  size  does  not  produce  any  meaningful  Increase  in 
similarity.  What  is  a  meaningful  Increase  is  still  a  subjective  judgment. 

Both  simulation  and  live  data  studies  should  be  run  to  evaluate 
calibration  procedures  because  of  the  basic  inadequacies  inherent  in  each 
type  of  study.  Simulated  data  never  accurately  represent  the  many 
extraneous  sources  of  variation  present  in  real  data.  Therefore, 
simulations  tend  to  give  a  better  result  that  can  be  obtained  from  real 
test  data.  By  contrast,  studies  using  real  data  have  the  problem  of  not 
knowing  the  "true”  parameters,  so  they  lack  a  good  criterion  for  accurate 
calibration.  Results  from  extremely  large  samples  do  not  provide  the 
criteria,  because  they  may  be  biased  if  a  poor  calibration  procedure  is 
used.  By  using  both  simulated  and  real  data,  the  weaknesses  of  each  type 
of  study  can  be  taken  into  consideration,  resulting  in  an  estimate  of  the 
required  sample  size  that  can  be  accepted  with  greater  confidence. 

Merely  having  a  large  sample  of  examinees  is  not  sufficient  to  ensure 
that  calibration  results  will  be  accurate.  If  the  ability  of  the  sample  is 
such  that  most  examinees  have  a  high  probability  of  responding  to  the  items 
-  that  is,  the  test  is  too  easy — it  will  not  be  possible  to  estimate  two 
critical  parameters  of  the  item  response  curve.  These  parameters  are  the 
ones  dealing  with  the  lower  as}nnptote  (guessing  level,  O  and  the  slope  at 
the  point  of  inflection  of  the  curve  (discrimination,  a).  In  order  to 
estimate  these  parameters,  the  sample  must  have  sufficient  numbers  of  cases 
at  the  middle  and  bottom  end  of  the  ability  range  measured  by  the  items. 
Thus,  a  large  sample  that  is  positively  skewed  is  more  desirable  than  one 
that  is  negatively  skewed.  If  necessary,  the  tryout  sample  should  be 
specifically  chosen  to  have  sufficient  cases  in  the  middle  and  lower 
ability  ranges. 

IE2.  The  calibration  sample  should  be  selected  so  that  a  sufficient  number 
of  cases  are  available  in  the  range  of  ability  needed  to  estimate  the  lower 
a83rmptote  and  the  point  of  inflection  of  the  IRC.  Essential. 

The  statistical  properties  of  the  Item  calibration  .procedure  should  be 
carefully  evaluated.  Since  the  selection  of  items  and  the  estlmtion  of 
ability  are  both  totally  dependent  on  the  accuracy  of  the  item  parameter 
estimates,  it  is  of  critical  importance  that  the  estimates  be  shown  to  be 
good  approximations  of  the  "true”  parameter.  From  a  statistical  point  of 
view,  there  are  several  criteria  for  what  is  considered  a  good  estimate. 


The  two  criteria  considered  of  Importance  here  are  measures  of  consistency 
and  statistical  unbiasedness.  A  consistent  estimate  Is  one  for  which  the 
expected  values  of  the  estimates  will  approach  the  true  value  as  the  sample 
size  Increases,  and  Its  variance  will  approach  zero.  This  criterion 
ensures  that  large  sample  estimates  are  good  estimates.  An  unbiased 
estimate  Is  one  for  which,  at  any  sample  size,  the  expected  value  of  the 
estimate  equals  the  true  estimate.  A  biased  estimate  can  be  consistent.  If 
the  bias  gets  smaller  as  the  sample  size  Increases,  and  tends  to  zero  as 
the  sample  size  Increases  without  bound.  There  Is  no  question  that  the 
estimates  should  be  consistent,  but  there  Is  some  argument  about  bias. 
Further,  there  Is  no  theoretical  proof  at  present  that  any  of  the  methods 
yields  consistent  estimators.  Establishing  consistency  Is  at  present  an 
empirical  problem,  so  we  use  the  term  empirical  consistency. 

Bias  may  be  less  Important.  All  Bayesian  estimations  are  biased 
Such  estimators  may  have  smaller  mean  squared  error  than  other  methods.  In 
which  case  their  use  Is  justified.  But,  when  biased  estimation  Is  used, 
the  extent  of  the  expected  bias  should  be  known. 

These  considerations  lead  to  the  following  recommendations. 

IE3.  The  procedure  for  estimating  Item  parameters  should  be  shown  to  be 
empirically  consistent.  Essential. 

IE4.  The  procedure  for  estimating  Item  parameters  should  be  shown 
empirically  to  be  unbiassed,  or  the  extent  and  nature  of  the  bias  should  be 
specified.  Essential. 

Bias  Is  a  problem  mainly  In  putting  together  estimates  obtained  from 
different  data  sets.  Such  combinations  of  estimates  Is  required  In 
adaptive  testing,  because  a  large  item  pool  must  be  calibrated  for  each 
test.  If  equivalent  samples  of  the  same  size  are  used  in  calibrating 
different  sets  of  Items,  the  calibrations  can  be  linked  In  a 
straightforward  way.  But  If  the  samples  vary  In  sample  size  or  In  the 
shape  of  the  ability  distribution,  biases  may  differ.  Introducing  extra 
error  In  the  linking  process.  The  bias  can  also  be  troublesome  if  Items 
are  recalibrated  In  an  operational  setting,  which  we  recommend  below,  for 
reasons  presented  there.  Here  the  Issue  Is  that  any  procedure  for 
recalibrating  the  Items  will  have  to  recognize  the  Inherent  bias  In  Item 
parameters  of  the  Item  calibration  procedure  Is  biased. 

We  note  that  the  Issue  of  estimation  bias  Is  critical  because  some  of 
the  prominent  procedures  for  Item  calibration  Including  one  due  to  Urry 
(1981),  uses  a  Bayesian  framework,  which  Is  Inherently  biased.  This 
statistical  bias  Is  not  seen,  by  Bayesians,  as  a  bad  thing  but  as  a 
conservative  thing,  in  the  same  way  that  ordinary  least-squares  regression 
yields  conservatively  biased  predictions.  In  essence,  the  estimates  are 
biassed  toward  a  prior  distribution  of  ability,  which  Is  commonly  specified 
as  normal.  Although  Urry's  procedure  does  not  Include  prior  distributions 
for  Item  parameters,  the  net  result  Is  that  the  ^  parameter  estimates  tend 
to  be  regressed  toward  the  mean  ability.  The  procedure  of  Swamlnathan 
(Note  5)  and  Reiser  (Note  3)  does  include  prior  distributions  on  £  and  a. 
So  long  as  bias  can  be  measured  explicitly  in  all  uses  of  the  parameters, 
the  bias  can  be  tolerated,  but  it  does  complicate  the  system. 
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Results  from  simulation  studies  can  be  used  to  determine  If  the 
parameter  estimation  procedure  yields  empirically  consistent  and  unbiased 
estimates.  Response  data  can  be  generated  for  a  sample  similar  In  size  to 
that  available  for  live  testing  applications  using  specified  Item 
parameters  and  a  known  ability  distribution.  This  data  can  then  be 
calibrated  using  the  estimation  procedure  of  Interest  and  the  parameter 
estimates  compared  to  the  known,  true  parameters.  This  can  most  easily  be 
done  by  plotting  one  set  against  the  other.  If  the  resulting  plot  tends  to 
follow  a  45  degree  line,  the  estimates  are  unbiased.  If  the  plotted  points 
cluster  more  closely  around  the  line  with  Increased  sample  size,  the 
estimates  may  be  called  empirically  consistent.  An  alternative  analysis  Is 
to  compute  average  squared  deviation  or  absolute  deviation  statistics 
between  the  true  and  estimated  parameters  to  indicate  their  similarity. 

It  Is  recognized  that  the  gttesslng  level  parameter,  £,  is  not  easily 
estimated  for  easy  Items.  There  will  be  a  need  for  Items  that  are  so  easy 
that  even  the  lowest-scoring  persons  will  have  a  moderate  probability  of 
correctly  answering  the  Item.  For  those  items,  estimate  c^  Is  very 
difficult.  Recent  work  by  Swamlnathan  (Note  5)  and  Reiser  (Note  3)  on 
Bayes-constrained  estimation  of  the  parameters  has  Improved  prospects  for 
stable  estimation  of  the  ^  parameters. 

Simulation  data  alone  cannot  demonstrate  the  effectiveness  of  an  Item 
calibration  procedure.  No  matter  how  conscientiously  produced,  simulation 
data  does  not  have  the  same  richness  of  variation  as  the  responses  of 
Individuals  to  test  Items.  Therefore,  it  Is  important  that  the  calibration 
be  shown  to  yield  satisfactory  results  on  real  data  as  well  as  simulation 
data.  The  procedure  used  to  determine  the  quality  of  Item  calibration  Is 
the  comparison  of  empirical  item  response  curves  (IRC's)  with  the  IRC's 
based  on  the  Item  parameter  estimates.  Empirical  IRC's  can  be  obtained  by 
dividing  the  ability  scale  into  several  intervals  and  determining  the 
proportion  correct  for  each  Interval  from  the  Item  data.  A  considerable 
number  of  intervals  should  be  used  (we  suggest  15-20)  so  that  the  variation 
In  ability  within  an  interval  Is  small  enough  to  be  Ignored.  Both  the 
empirical  and  estimated  IRC's  can  then  be  plotted  on  the  same  axes  for 
comparison.  A  quantitative  evaluation  of  these  curves  can  be  obtained  by 
using  the  chi-square  statistic  suggested  by  Yen  (1981).  Strictly  speaking, 
this  statistic  should  be  used  only  when  the  abilities  are  estimated  from 
other  Items,  but  It  does  give  a  means  of  judging  the  relative  fit  of  the 
estimated  IRC's  to  the  actual  item  data.  Levine  (Note  1)  has  also  proposed 
a  method  of  assessing  the  fit  of  IRC's. 

IE5.  The  IRC's  defined  by  the  estimated  Item  parameters  should  fit  the 
observed  data.  Essential. 

It  will  probably  be  necessary  to  do  the  Initial  item  calibration  with  data 
obtained  from  P&P  administrations.  It  is  possible  that  the  characteristics 
of  items  are  different  kithe  P&P  and  the  CAT  formats.  A  study  Is  suggested 
below,  in  the  discussion  of  human  factors,  to  examine  this  issue.  If  there 
should  be  an  effect  of  mode  of  presentation,  this  effect  will  have  to  be 
taken  into  consideration  when  equating  the  CAT  scale  with  the  current  P&P 
ASVAB  scale. 
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When  CAT  becomes  operational,  or  as  early  as  may  be,  a  check  should  be 
made  of  the  item  parameters  In  the  operational  context.  This  may  Involve  a 
re-estlmatlon  of  Item  parameters,  and  at  least  should  Involve  a  comparison 
of  the  IRC  curve  specified  by  the  Item  parameters  and  the  empirical  IRC 
curve.  To  do  this  Involves  Inserting  each  Item  Into  the  series  for  a  group 
of  examinees,  since  if  we  rely  on  the  normally  accumulated  responses  for 
the  Item,  the  upper  and  lower  portions  of  the  curve  will  be  poorly 
estimated. 

IE  6.  The  operational  CAT  system  must  be  able  to  Include  a  specified  Item 
In  a  test  sequence  without  scoring  it,  on  a  flexible  predetermined 
schedule.  Essential. 

When  this  has  been  done  for  all  or  moat  of  the  Items  in  the  pool,  the  data 
should  be  examined  to  determine  If  some  adjustment  In  Item  parameters  is 
necessary.  Here  the  question  of  test  score  calibration  Is  paramount. 

IE7 .  Items  that  Include  diagrams  should  be  recalibrated  either  on 
prototype  equipment  or  on  the  operational  equipment.  Essential. 

There  will  he  enough  uncertainty  with  the  tests  containing  diagrams  that 
such  Items  should  be  recalibrated  on  the  actual  equipment.  Some  tests  use 
Items  with  diagrams,  as  discussed  below  under  human  factors.  Difficulty  of 
the  item  may  be  altered  by  the  legibility  of  the  diagrams. 

IE8.  As  soon  as  possible  a  study  must  be  done  to  compare  the  difficulty 
parameters  of  Items  given  in  the  standard  paper  and  pencil  mode  with  the 
same  Items  given  by  the  computer.  Essential  and  urgent. 

One  overall  effect  of  computer  presentation  maybe  to  change  the 
difficulty  of  Items  on  the  power  tests.  The  effect  on  the  speeded  tests  Is 
certain,  as  noted  elsewhere.  If  the  effect  on  the  power  tests  Is 
significant.  It  will  have  Implications  for  the  plan  to  calibrate  the  item 
pools  by  P&P  methods.  If  the  effect  Is  constant  across  items,  it  will  not 
be  noticed,  since  the  Item  calibration  Is  relative  to  an  ability 
distribution  with  a  specified  mean  and  variance.  A  constant  effect  would, 
however,  cause  trouble  In  equating  with  previous  tests,  so  the  actual  size 
of  the  effect  must  be  determined.  But  If  some  Items  are  affected  more  than 
others.  Item  parameters  determined  from  the  P&P  mode  are  open  to  question. 
We  do  not  expect  a  differential  effect,  except  possibly  for  Items  with 
diagrams.  But  an  empirical  determination  of  the  presence  or  absence  of  a 
differential  effect  Is  necessary. 

The  experiment  should  be  done  first  with  power  tests  that  do  not 
Include  diagrams,  using  experimental  or  prototype  equipment.  When 
appropriate  equipment  becomes  available,  the  tests  with  diagrams  should  be 
examined . 

The  experiment  should  compare  the  two  modes  of  test  presentation.  In 
the  context  of  a  standard  test,  with  the  computer  not  In  an  adaptive  mode, 
because  data  obtained  from  administering  an  adaptive  test  cannot  readily  be 
used  to  estimate  Item  parameters.  In  an  adaptive  test,  each  Item  Is 
administered  to  a  different  set  of  persons,  usually  whose  probability  of 
giving  a  correct  response  Is  neither  very  small  nor  very  large. 
Comparability  of  results  in  the  proposed  experiment  requires  that  the 
people  who  attempt  each  Item  In  the  two  conditions  have  the  same  ability 
distributions. 


Note  that  any  single  Item  can  be  calibrated  In  an  adaptive  setting  by 
administering  It  to  all  persons.  Independently  of  their  ability,  randomly 
Inserting  the  Item  Into  the  sequence  of  administered  Items.  But  this 
procedure  would  require  large  numbers  of  examinees  In  the  present  study, 
since  a  different  group  would  be  needed  for  each  item. 

The  experiment  should  balance  order  of  presentation,  for  the  special 
battery  of  power  tests,  being  examined.  All  tests  of  the  speed  battery 
should  be  given  together  In  one  mode,  and  then  In  the  other  mode.  (It  Is 
conceivable  that  the  adaptive  test  would  In  Itself  sufficiently  change  the 
difficulty,  but  this  seems  most  unlikely.  The  change  due  to  mode  of 
presentation  Is  also  unlikely,  but  Is  at  least  conceivable.)  Many 
experiment  designs  are  possible.  In  one,  which  seems  to  be  the  simplest, 
each  test  in  the  battery  would  be  prepared  In  both  P&P  mode  and  computer 
presentation  mode.  Two  groups  of  subjects  would  be  randomly  assigned  (this 
Is  vitally  important)  to  one  of  two  groups.  Group  I  would  take  the  battery 
in  P&P  mode.  Group  II  would  take  the  battery  in  computer  presentation.  An 
analysis  of  variance  would  be  run  on  the  parameters  themselves,  using  the  ^ 
values  directly,  but  using  log  of  a  and  logit  of  c^.  These  transformations 
will  make  the  data  appropriate  for  the  linear  analysis  of  variance  model. 
Notice  that  sample  size  need  not  be  1000,  as  In  ordinary  parameter 
estimation,  because  there  Is  no  Intent  of  using  the  parameters  for 
individual  Items.  The  issue  Is  whether  there  are  main  effects  and 
Interactions  for  the  set  of  Items.  Probably  samples  of  200  In  each  group 
would  suffice. 

The  main  question  Is  whether  there  Is  a  main  effect  on  Item  difficulty 
and  whether  item  difficulty  Interacts  with  test  mode.  The  Intercorrelation 
of  the  two  test  modes  is  a  secondary  aspect  of  this  question.  It  would  he 
best  to  fit  IRC's  to  these  data,  but  the  main  questions  can  be  answered  by 
standard  Item  analysis. 
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Item  parameters  -  Linking. 

Linking  is  the  process  of  putting  the  results  of  separate  calibrations 
on  the  same  scale.  Unless  large  numbers  of  items  can  be  administered  to 
many  Individuals,  linking  is  necessary  for  the  formation  of  large  Item 
pools.  The  usual  procedure  In  forming  large  item  pools  Is  to  administer 
many  short  tests  to  many  different  groups  of  individuals.  The  results  of 
the  separate  calibrations  of  each  test  are  then  linked  together  to  form  one 
large  set  of  calibration  data.  Since  the  parameter  estimates  actually  used 
In  the  CAT  procedure  are  those  determined  from  the  linking,  the  quality  of 
these  estimates  Is  critical.  Even  good  calibration  results  can  he  ruined 
by  poor  linking. 

In  a  recent  study  of  linking  procedures  (Vale  et  al,  1981)  four 
different  types  of  linking  designs  were  considered.  Two  of  the  designs 
depended  on  sampling.  In  the  equivalent— groups  procedure  different  subsets 
of  Items  are  given  to  different  random  samples  of  the  population  of 
test-takers.  Each  set  of  Items  Is  calibrated  separately,  and  the  results 
rescaled  so  that  the  mean  and  standard  deviation  of  the  ability  scores  of 
the  two  groups  are  equated,  on  the  assumption  that  the  groups  are 
equivalent. 

In  the  equivalent-tests  method,  subsets  are  determined  by  a  random 
process,  and  are  given  to  different  groups.  It  Is  assumed  that  the  process 
results  In  parallel  tests.  Me  doubt  the  wisdom  of  such  an  assumption, 
since  small  samples  of  Items  are  Involved.  Since  Vale  et  al  found  this 
method  Inferior,  It  Is  not  recommended. 

Vale  et  al  also  consider  what  they  call  the  anchor-group  method  In 
which  one  group  of  persons  takes  all  the  Items.  Since  the  point  of  linking 
Is  to  avoid  such  a  requirement,  this  method  Is  not  recommended. 

The  other  viable  methods  involve  overlapping  sets  of  Items.  The  most 
common  such  design  Is  the  anchor-test  method  In  which  one  subset  of  Items 
Is  taken  by  all  persons,  and  provides  the  base  for  linking  the  remaining 
Items.  This  design  is  sound  but  is  especially  pertinent  to  equating 
successive  forms  of  a  test  in  a  testing  program  like  the  College  Board 
series.  An  extension  of  this  design  has  each  test  sharing  some  Items  with 
some  other  tests.  (See  McKinley  &  Reckase,  1981b.)  The  optimal  design  for 
use  In  this  method  Is  a  balanced  Incomplete  block  design  where  the  groups 
of  Individuals  define  the  blocks  and  the  Items  are  the  treatments.  Each 
test  would  be  calibrated  separately  for  each  group  and  the  parameter' 
estimate  would  be  used  as  the  dependent  variable  In  an  analysis  of  variance 
to  determine  the  transformation  (treatment  effect)  required  to  place  them 
all  on  the  same  scale.  If  the  equivalent  groups  method  Is  used  with 
parameter  estimates  from  a  three-parameter  model,  the  Rvalues  can  be  used 
as  Is,  but  the  log  of  Che  a-values  should  be  used  and  the  logit  of  the 
^-values,  as  suggested  above.  As  an  alternative,  a  program  that  accepts  a 
not-reached  code  (such  as  LOGIST  or  BILOG)  can  be  used  to  estimate  all 
Items  simultaneously. 
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ILl .  When  using  a  common  Item  procedure  to  link  calibration  together,  the 
parameters  must  be  shown  to  be  on  the  same  scale.  Essential. 

The  procedure  recommended  to  show  the  quality  of  the  linking  procedure  is 
to  form  a  circular  chain  of  linked  tests  with  the  first  test  eventually 
linked  to  the  last  test.  That  is.  Test  1  should  be  linked  to  Test  2,  Test 
2  to  Test  3,  Test  3  to  Test  4,  and  Test  N-1  to  Test  N  where  Test  N  is  the 
same  as  Test  1.  Thus  the  initial  parameter  estimates  for  Test  1  can  be 
compared  to  the  linked  parameter  estimates  for  Test  1  (Test  N)  to  determine 
their  similarity.  This  procedure  can  be  performed  using  data  from  the 
balanced  incomplete  block  design  described  above. 

IL2.  The  linking  procedure  used  should  be  fully  described.  Essential. 

IL3.  The  similarity  of  initial  and  linked  estimates  should  be  presented. 
Essential. 

Correlations  are  Inappropriate  for  this  measure  of  similarity.  The 
parameters  could  be  on  quite  different  scales  and  still  give  high 
correlations.  Some  type  of  deviation  statistic  such  as  the  average  squared 
or  absolute  deviation  would  be  much  more  appropriate. 

IL4.  When  using  an  equivalent  group  procedure  to  get  the  parameter 
estimates  on  the  same  scale,  the  groups  used  must  be  shown  to  be 
equivalent .  Essential. 

The  equivalent  group  procedure  is  totally  dependent  on  the  similarity  of 
each  of  the  groups  used  for  calibration.  The  sampling  plan  for  obtaining 
equivalent  groups  is  critical. 

IL5 .  The  methods  for  sampling  the  individuals  for  the  groups  should  be 
described  in  detail.  Essential. 

IL6.  Descriptive  statistics  showing  the  equivalence  of  the  groups  should 
be  reported.  Essential. 

The  means  and  standard  deviations  alone  are  not  enough  to  show  the 
similarity  of  groups  for  IRT  linking.  The  distributions  of  scores  must  be 
shown  to  be  similar.  This  can  he  done  by  reporting  coefficients  of 
skewness  and  kurtosls  or  by  graphically  comparing  distributions. 
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Item  Pool  Characteristics. 

Item  pool  characteristics  include  the  placement  and  quality  of  items 
along  the  ability  scale.  Regardless  of  the  quality  of  the  calibration  and 
the  linking,  if  gxxl  items  are  not  present  in  the  region  of  the  ability  scale 
of  Interest,  good  ability  estimates  will  not  be  obtained.  This  fact  is 
reflected  in  the  size  of  the  standard  error  of  the  ability  estimates  or  in 
the  number  of  items  needed  in  order  to  achieve  a  specified  error 
criterion. 

IPl .  The  distribution  of  the  a-parameter  estimates  and  descriptive 
statistics  for  the  estimates  should  be  presented.  Very  desirable. 

IP2 .  The  distribution  of  the  b-parameter  estimates  and  descriptive 
statistics  for  the  estimates  should  he  presented.  Very  desirable. 

Special  mention  should  be  made  of  any  gaps  in  the  item  pool. 

IP3 .  The  distribution  of  the  c-parameter  estimates  and  descriptive 
statistics  for  the  estimates  should  be  presented.  Very  desirable. 

IP4.  The  information  function  for  the  total  item  pool  should  be  presented. 
Very  desirable. 

The  information  function  will  show  where  the  item  pool  has  adequate  numbers 
of  items  and  where  few  or  low  quality  items  are  present.  It  will  also 
Indicate  the  range  of  ability  that  can  be  measured  by  the  pool. 

IPS.  The  anticipated  ability  distribution  should  be  plotted  on  the  same 
scale  with  the  information  function.  Essential. 

Even  if  the  item  pool  is  of  good  quality,  it  will  not  result  in  good 
measurement  unless  it  matches  the  ability  of  the  population  of  interest. 

For  example,  if  a  large  number  of  difficult  items  are  administered  to  a  low 
ability  group,  poor  measurement  will  result  regardless  of  the  quality  of 
the  items.  One  way  to  easily  check  if  the  items  match  the  ability  of  the 
groups  Is  to  plot  the  ability  distribution  of  the  examinee  population  on 
the  same  graph  with  the  item  pool  information  function.  The  two  plots 
should  overlap  for  the  majority  of  their  range.  If  no  existing 
distribution  exists,  a  normal  distribution  with  mean  and  standard  deviation 
estimated  from  past  testing  can  be  used. 


Item  Selection  and  Test  Scoring 

Several  different  methods  might  be  used  to  select  successive  Items  In 
adaptive  testing.  The  up-and-dovm  method  of  stochastic  approximation  (see 
Lord,  1970)  adjusts  the  difficulty  of  the  next  Item  either  up  or  dovm  a 
fixed  amount,  called  the  step  size.  The  Robblns-Munro  procedure  (see  Lord, 
1971,  and  Sampson,  1976)  provides  a  method  for  decreasing  the  step  size  as 
the  testing  progresses.  Neither  of  these  methods  Is  advocated  now,  because 
more  powerful  procedures  are  computationally  practical. 

Three  distinct  methods  are  presently  available  for  computing 
provisional  estimates  of  ability  and  selecting  Items  for  sequential  testing 
-  (1)  the  Bayes  updating  method  proposed  by  Owen  (1969,  1975),  (2)  the 
maximum  Information  method  discussed  by  Lord  (1977)  and  Samejlma  (1977a, b), 
and  (3)  a  finite  Bayes  method  recently  proposed  by  Bock  and  Aitkin  (1981). 
The  principles,  advantages,  and  disadvantages  of  each  are  discussed  In  this 
section. 

1.  Bayes  updating.  Although  Informal  trials  of  adaptive  sequential  Item 
testing  had  been  carried  out  earlier  (Linn,  Rock  &  Cleary,  1969),  the  first 
statistically  motivated  proposal  was  that  of  Owen  (1969,  1975).  He  gave  a 
Bayes  updating  rule  based  on  the  posterior  mean  and  variance,  given  the 
subject's  response  to  one  Item.  On  tbe  assumption  that  the  posterior 
distribution  Is  approximately  normal  (strictly  speaking  It  cannot  be 
exactly  normal  even  when  the  prior  Is  normal),  Owen's  result  can  be  applied 
recursively  to  estimate  the  mean  and  variance  of  the  posterior  distribution 
after  any  number  of  successive  Item  responses.  The  mean  Is  then  the 
estimate  of  the  subject's  latent  ability  and  the  variance,  the  estimated 
measurement  error. 

The  Item  selection  rule  Is  to  choose  the  Item  that  will  most  reduce 
the  posterior  variance.  That  Item  proves  to  be  the  one  with  the  highest 
discriminating  power  among  those  In  the  neighborhood  In  the  prior  mean  In 
difficulty.  The  process  Is  repeated  until  the  stopping  criterion  Is 
reached . 

Owen  (1975)  proved  that  this  rule  almost  certainly  converges  to  the 
value  of  the  trait,  and  Wood  (1971)  demonstrated  Its  properties  In 
application  to  real  and  simulated  subjects.  Using  a  1000-ltera  pool  of 
vocabulary  Items,  Wood  successfully  estimated  vocabulary  knowledge  of  high, 
medium  and  low  ability  4th,  5th,  and  6th  graders  with  uniformly  good 
precision  with  about  25  Items  in  most  cases.  He  found,  in  fact,  that 
better  precision  than  with  conventional  tests  was  often  attained  with 
twenty  Items,  the  gains  in  precision  being  small  after  that  point.  McBride 
(1977)  also  studied  Owen's  strategy. 

The  equations  of  Owen's  method  are  relatively  simple  because  of  the 
use  of  a  normal  prior  distribution  and  normal  Item  characteristic  curves, 
and  by  the  simplifying  assumption  that  the  posterior  distribution  of 
ability  Is  also  normal. 


Recently,  Bock  and  Aitkin  (1981)  have  shown  that  a  stra Ight forward 


Bayes  method  can  be  used,  without  the  simplifying  assumption,  because  the 
necessary,  and  apparently  complicated,  calculations  are.  In  fact,  quite 
simple  to  do  by  numerical  methods  of  Integration. 

2.  Maximum  Information.  The  result  from  item  response  theory  most  crucial 
to  adaptive  testing  is  the  provision  for  ’‘Item-Invariant"  estimation  of 
ability.  Estimates  on  a  common  scale  can  be  computed  for  different 
examinees  for  different  subsets  of  Items  from  a  calibrated  Item  pool. 

Thus,  Items  that  are  optimally  Informative  can  be  selected  for  each 
examinee  without  affecting  the  comparisons  between  examinees.  In  the  class 
of  Item- Invariant  estimators,  the  maximum  likelihood  estimator  proposed  by 
Blrnbaum  (1968)  has  been  most  Intensely  Investigated  for  use  In  adaptive 
testing.  Lord  (1977b),  Samejlma  (1977a),  and  others  have  used  Monte  Carlo 
methods  to  examine  the  properties  of  the  maximum  likelihood  estimator  In 
this  role. 

Briefly,  the  maximum  likelihood  estimate,  ft,  of  an  ability  ft,  given 
the  Item  scores  xj(xj»l  If  correct,  0  in  incorrect)  for  j»l,2,...n,  is 
the  solution  of  the  likelihood  equation 


n  Xj-G^(e)  3G^(9) 

Gj(0)[l-Gj(0)]‘  30 


where  C®)  Is  a  general  Item  response  function  and  conditional 

Independence  given  6  Is  assumed. 

If  this  equation  has  a  finite  solution.  It  can  usually  be  found 
efficiently  by  Newton-Raphson  Iterations,  or  that  falling,  less  efficiently 
by  direct  line  search.  The  cases  where  the  likelihood  equation  does  not 
have  a  finite  solution  are  discussed  below. 

The  limiting  variance  of  the  maximum  likelihood  estimator  with  respect 
to  sampling  of  Items  from  an  Infinite  pool  Is  given  by  the  reciprocal  of 
the  test  Information  function,  as  discussed  above  in  the  background 
material.  Assuming  the  Items  carry  some  Information  about  0,  It  Is 
apparent  that  the  measurement  error  will  be  minimized  if  each  Item  Is 
selected  to  have  maximum  Information  at  0.  The  error  variance  of  the 
maximum  likelihood  estimator  will  decrease  as  Items  are  added  and 
eventually  the  stopping  criterion  will  be  reached. 

The  main  limitation  of  maximum  likelihood-maximum  Information 
procedures  In  adaptive  testing  Is  that  a  finite  estimate  of  ability  does 
not  exist  when  all  of  the  examinee's  responses  are  correct  or  all 
incorrect,  or  when  the  guessing  model  Is  used  and  certain  unfavorable 
answer  patterns  occur  (Samejlma,  1973).  The  maximum  Information  procedure 
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cannot  begin,  therefore,  until  at  least  one  correct  and  one  incorrect 
response  has  occurred,  and  with  the  guessing  model  there  is  a  finite 
probability  that  it  will  fall  at  other  times.  In  either  case,  some  ad  hoc 
rule  must  be  adopted  to  keep  the  procedure  on  track.  Samejlma  (1981),  for 
example,  has  proposed  a  procedure  for  attributing  an  ability  when  responses 
are  all  correct  or  all  Incorrect.  Another  possibility  is  the  "blwelghted" 
maximum  likelihood  estimator  proposed  by  Mlslevy  and  Bock  (1982).  By 
multiplying  each  term  In  the  likelihood  equation  by  a  Mosteller-Tukey 
(1977)  biwelght,  one  may  suppress  the  effects  of  chance  or  other  spurious 
responses  to  items  early  in  the  sequence  when  the  provisional  estimate  of 
ability  is  poor.  This  Improves  the  robustness  of  the  estimator  against 
unfavorable  answer  patterns. 

An  Important  feature  of  both  major  types  of  item  selection  strategies 
is  that  they  continually  revise  the  estimate  of  the  ability,  0,  at  every 
step.  Thus  an  estimate  of  ability  is  an  Inherent  part  of  the  item 
selection  process. 

Recommendat ions .  The  value  and  utility  of  item  selection  and  scoring 
ultimately  rests  on  the  degree  of  precision  and  efficiency  obtained. 
Evaluation  of  reliability  and  precision  has  been  discussed  above.  Here  we 
consider  efficiency,  and  other  ancillary  issues. 

151 .  The  procedure  for  item  selection  and  ability  estimation  must  be 
documented  explicitly  and  in  detail.  Essential. 

152.  The  procedure  should  Include  a  method  of  varying  the  items  selected, 
to  avoid  using  a  few  items  exclusively.  Essential. 

153 .  The  procedures  used  should  include  a  mechanism  to  maintain  a  rough 
balance  of  correct  answer  options.  Desirable. 

Several  algorithms  might  be  used  to  select  the  next  item  for  a  candidate, 
conditional  upon  his  previous  responses.  If  the  algorithm  selects  the  most 
informative  next  item,  then  only  the  most  discriminating  items,  that  is, 
the  items  with  the  highest  values  of  aj  will  be  selected  frequently, 
whereas  items  that  are  very  nearly  but  not  quite  as  good  will  seldom  be 
selected. 

Whatever  algorithm  is  used  for  item  selection,  we  recommend  listing 
the  most  informative  items  -  perhaps  the  ten  best,  perhaps  all  whose 
information  is  at  least  90%  of  the  Information  one  would  get  from  the  best 
item.  Then  a  random  selection  would  be  made  from  that  set  of  nearly 
equivalent  items.  (Or,  the  selection  could  be  weighted  in  favor  of  items 
that  had  not  been  used  as  much.) 

Item  selection  is  relevant  to  another  problem.  On  a  standard  test  it 
is  necessary  to  randomize,  or  at  least  mix  up,  the  answer  option  that  is 
the  correct  answer  to  the  questions.  It  is  not  acceptable  to  have  (c)  be 
the  correct  answer  most  of  the  time.  In  the  computer  presentation  mode, 
this  is  less  of  a  problem  because  the  test  taker  does  not  have  the  record 
of  his  past  responses  before  him.  Still,  some  mixing  is  necessary. 


53 


The  effect  of  answer  option  distribution  Is  subject  to  experimental 
study.  Perhaps  less  able  candidates  favor  the  options  encountered  first  (a 
or  b) .  Probably  there  Is  very  little  chance  that  anyone  will  encounter  a 
string  of  problems  where  the  correct  answer  option  is  the  same,  and  there 
may  be  essentially  no  chance  at  all  that  an  examinee  would  notice  the 
pattern.  Recall  that  the  examinee  is  getting  about  1/3  of  the  Items  wrong!. 
Still,  this  Is  an  aspect  of  adaptive  tests  that  should  be  studied. 

On  a  tailored  test  the  pattern  of  correct  options  Is  unique  to  each 
test  taker.  Probably  the  computer  system  should  keep  track  of  previous 
correct  answer  options  for  this  respondent  on  this  test.  The  infrequently 
used  options  can  be  used  to  Influence  the  choice  among  nearly  equivalent 
Items  at  each  step. 

Obviously,  the  Item  pool  for  a  test  should  position  the  correct 
answers  with  equal  frequency  over  the  options.  Since  the  sets  of  equally 
Informative  Items  from  which  Item  selection  Is  made  will  be  sets  of  items 
of  very  similar  difficulty,  the  correct  answer  options  should  be  balanced 
by  difficulty  level  within  each  pool.  Some  experience  will  show  whether 
special  pains  should  be  taken  to  keep  the  answer  options  In  balance  for 
each  test-taker. 

In  principal,  answer  options  could  he  rearranged  for  an  Item  when  It 
Is  presented,  provided  that  there  Is  no  natural  order  for  the  options.  But 
we  know  so  little  about  the  processes  of  answering  Items  that  even  that 
slight  manipulation  might  be  dangerous. 


IS4.  The  computer  algorithm  must  be  capable  of  administering  designated 
Items,  and  recording  the  response  separately,  without  Interfering  with  the 
adaptive  process.  Essential. 


The  Item  selection  program  will  have  to  permit  administering  Items  that  are 
being  pretested  or  Items  that  are  being  recalibrated,  in  the  course  of  a 
regular  test.  The  computer  programs  must  be  able  to  handle  this 
possibility. 


Predicting  a  good  starting  value.  After  the  first  test  in  the  battery 
has  been  administered,  additional  efficiency  could  be  gained  by  using  a 
regression  estimate  of  the  examinee's  ability  on  each  subsequent  test  as  a 
starting  place  for  the  tailoring  process.  This  procedure  has  been  used 
with  good  results  by  Maurelll  &  Weiss  (1981).  If  this  scheme  Is  used,  then 
the  order  of  tests  In  the  battery  becomes  important.  At  least,  the  first 
test  should  be  the  test  having  the  highest  correlation  with  all  the  others 
-  the  test  closest  to  the  first  principal  component  of  the  battery.  (That 
test  is  probably  Word  Knowledge,  or  Science  Information.)  The  first  test 
might  well  be  tested  to  a  slightly  more  stringent  accuracy  criterion  (or 
using  slightly  more  Items)  than  the  other  tests.  If  It  Is  given  the  added 
role  of  predicting  starting  values  for  subsequent  tests. 


IS5.  The  computer  system  must  be  able  to  base  the  choice  of  a  first  Item 
on  prior  Information.  Essential. 


The  possibility  has  been  considered  of  choosing  a  starting  Item  for 
the  first  test  on  the  basis  of  external  Information  such  as  number  of  years 
of 


formal  schooling.  We  view  this  as  unsound.  First,  this  might  be  unfair  to 
certain  ethnic  subgroups.  Second,  the  test  Is  intended  to  provide 
Independent  Information  on  ability.  To  use  ancillary  Information  would 
disrupt  the  Independence  being  assumed  In  the  general  prediction  and 
counselling  situation. 


Stopping  Rules 


One  of  the  most  attractive  properties  of  adaptive  testing  Is  the 
possibility  of  having  a  constant  measurement  error  variance  at  all  levels 
of  ability.  This  not  only  simplifies  discussion  of  the  reliability  of  the 
test  (q.v.),  but  It  also  satlsHes  better  the  assumption  of  homogeneity  of 
variance  In  subsequent  test  analysis.  In  the  regression,  homogeneity  of 
measurement  error  variance  Is  a  necessary  condition  for  homogeneity  of 
residuals.  It  can  be  attained  In  adaptive  testing  If  an  Information 
criterion  (or  posterior  variance  criterion)  Is  used  to  terminate  the  Item 
presentations.  If,  for  example,  testing  Is  continued  until  the  error 
function  Is  1/10  of  the  population  standard  deviation  of  ability,  the 
adaptive  procedure  will  have  a  uniform  reliability  of  1/(1  +  I/IO)  »  .91, 
which  would  be  acceptable  In  most  testing  applications.  When  the  adaptive 
test  Is  replacing  a  conventional  test,  the  criterion  should  be  the  error 
variance  of  the  latter  at  the  ability  level  where  It  Is  most  reliable.  If 
the  adaptive  test  Is  continued  until  the  measurement  error  attains  this 
value,  the  adaptive  test  will  always  be  as  reliable  or  more  rellablle  than 
the  conventional  test  regardless  of  the  population  in  which  It  Is  applied. 

A  simpler  stopping  rule  Is  always  to  use  the  same  number  of  items. 

With  this  procedure,  the  ability  of  some  persons  will  be  estimated  more 
accurately  than  others.  On  the  average,  very  low  and  very  high  ability 
levels  will  not  be  estimated  as  well  as  the  middle  levels  even  when  the 
first  Item  Is  selected  on  the  basis  of  other  relevant  performance. 

One  disadvantage  of  stopping  at  a  fixed  level  of  measurement  error 
variance  Is  that  some  persons  may  need  many  more  Items  than  others,  which 
may  have  some  operational  difficulties.  A  hybrid  rule  might  be  adopted  in 
which  testing  Is  stopped  when  an  acceptable  level  of  measurement  error  is 
reached,  or  when  a  certain  number  of  Items  Is  given,  whichever  happens 
first. 

156.  If  testing  continues  until  a  specified  level  of  measurement  error 
variance  Is  attained,  the  average  number  of  Items  necessary  should  be 
reported  as  a  function  of  ability  level.  Desirable. 

157 .  If  testing  Is  stopped  after  a  fixed  number  of  Items  Is  given,  the 
achieved  level  of  measurement  error  variance  should  be  reported. 

Essential. 

This  recommendation  Is  discussed  In  the  section  on  reliability. 

158.  If  a  hybrid  rule  Is  adopted  for  stopping  testing,  a  report  should  be 
made  of  both  the  measurement  error  level  achieved  as  a  function  of  ability; 
Essential,  and  the  average  number  of  Items  used.  Desirable. 
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Equating  of  Ability  Scales 

When  Initially  Implementing  a  CAT  procedure.  It  may  be  necessary  to 
test  some  Individuals  with  the  previously  used  P&P  procedure  until  full 
Implementation  is  achieved.  It  may  also  be  desirable  to  compare  scores  on 
the  CAT  procedure  to  previously  developed  norms.  In  both  of  these  cases  it 
is  necessary  to  form  tables  to  convert  the  ability  estimates  from  the  CAT 
procedure  to  the  score  scale  from  the  traditional  tests.  The  formation  of 
this  table  Is  called  equating.  If  the  equated  ability  estimates  are  to  be 
interpreted  properly,  the  accuracy  of  this  equating  must  be  demonstrated. 

Q1 .  When  a  paper  and  pencil  test  and  a  computerized  test  on  the  same 
content  are  administered  to  the  same  person,  the  Interpretation  of  the 
scores  must  be  shown  to  be  the  same.  Essential. 

The  evaluation  of  the  equating  of  test  forms  Is  a  very  difficult  task 
since  there  Is  usually  no  standard  for  comparison.  Therefore,  to  get  some 
Indication  of  quality,  similarity  of  equating  results  Is  often  used.  In 
the  case  of  evaluating  the  equating  of  the  CAT  score  scale  to  a  paper  and 
pencil  test,  the  following  procedure  could  be  used.  First,  two  parallel 
pools  of  test  Items  (A  and  B)  that  have  item  parameters  on  the  same  scale 
should  be  created.  One  group  of  Individuals  would  then  be  administered 
tests  using  both  the  paper  and  pencil  form  and  the  CAT  procedure  using  the 
Pool  A.  Based  on  this  administration  an  equivalent  score  table  would  be 
produced  using  one  of  the  many  procedures  available  (eg.  IRT,  Lord  1981b, 
or  equlpercentlle.  Lord,  1981a).  The  same  process  could  also  be  followed 
using  the  paper  and  pencil  test  and  Pool  B.  If  the  equating  Is  acceptable, 
the  ability  estimates  determined  using  Pool  A  should  he  equated  to  the  same 
paper  and  pencil  score  as  those  obtained  from  Pool  B. 

Q2 .  The  rank  order  of  Individuals  ordered  by  both  a  CAT  and 
paper-and-pencll  Instrument  on  the  same  content  should  be  approximately  the 
same.  Essential. 

Q3.  The  ability  estimates  obtained  from  the  CAT  procedure  should  be 
measuring  the  same  trait  as  the  scores  from  the  paper  and  pencil  test. 
Essential . 

Evidence  for  the  above  two  recommendations  can  be  obtained  In  a  manner 
similar  to  that  used  In  the  validity  section  of  these  standards.  Care  must 
be  taken  to  compensate  for  possible  nonlinearity  of  the  relationship 
between  raw  scores  and  ability  estimates.  One  procedure  Is  to  use  expected 
true  scores  rather  than  ability  estimates,  9,  In  the  analysis.  Another  Is 
to  use  scores  on  the  equated  scales. 

The  problem  of  calibrating  Che  CAT  tests  needs  more  thorough  study 
than  we  have  been  able  to  give  it.  As  noted  In  the  Introduction,  scaling 
errors  can  have  Important  effects  (Department  of  Defense,  1980a, b). 

Careful  study  Is  needed  of  the  current  calibrations  of  the  ASVAB  (Maler  & 
Grafton,  1981  a,b). 

It  will  be  advisable  Co  make  use  of  a  very  well-established  data  base, 
the  Profile  of  American  Youth  (Department  of  Defense,  1982)  which 
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established  national  norms  on  ASVAB  Form  8A. 

04.  Comparative  data  should  be  available  on  ASVAB  form  8A  and  the  CAT  Item 


Human  Factors 


The  CAT  will  be  presented  on  some  kind  of  computer  display,  and 
responses  will  be  made  on  some  kind  of  keyboard.  Because  this  method  of 
test  administration  differs  markedly  from  the  conventional  P&P  test, 
special  care  should  be  taken  to  Insure  that  the  devices  and  the  environment 
In  which  the  test  is  taken  be  conducive  to  good  test  performance.  A 
variety  of  specific  factors  will  be  noted  first,  and  then  the  cumulative 
effect  of  the  novel  environment  will  be  considered. 

HFl .  The  environment  of  the  testing  terminal  should  be  quiet  and 
comfortable,  free  of  distractions.  Essential. 

It  Is  an  axiom  of  test  admlnstratlon  that  the  environment  be  quiet  and 
comfortable.  This  Is  widely  understood  and  Is  relatively  easy  to  achieve 
In  a  P&P  mode.  However,  computer  terminals  are  often  set  up  In  large  rooms 
that  have  considerable  ambient  noise,  and  activity.  It  is  important  to 
stress  that  such  an  environment  Is  inappropriate.  CAT  requires  a  quiet 
environment,  free  of  distractions.  A  separate  cubicle  for  each  terminal 
would  be  desirable.  (A  mundane  point  Is  that  a  paper  and  scratch  paper 
must  be  available.) 

It  Is  Important  to  note  that  the  test  could  In  principal  be  given  In  a 
nolsy^ frenetic  environment,  so  long  as  the  same  type  of  environment  was 
provided  for  everyone.  Almost  always,  a  nolsy^ frenetic  environment  means 
lack  of  control  over  the  environment,  so  that  everyone  Is  not  tested  under 
the  same  conditions.  The  Important  criterion  Is  fairness.  Everyone  must 
have  the  same  chance  to  succeed.  Also,  the  quiet  environment  makes  the 
test  more  nearly  a  pure  test  of  cognitive  ability,  skill  or  knowledge.  In 
a  noisy  environment,  the  test  would  also  have  a  component  of  ability  to 
work  In  such  environments.  That  might  be  an  Interesting  facet;  if  so  It 
should  be  explicitly  and  separately  evaluated. 

HF2.  The  display  screen  should  be  placed  so  that  It  Is  free  from  glare. 
Essential . 

Unless  care  Is  taken  In  the  design  of  the  display  device,  and  the  placement 
of  consoles,  the  room  lights,  or  sunlight  from  nearby  windows  could  be 
reflected  by  the  surface  of  the  display  screen,  greatly  reducing  legibility 
and  Increasing  testing  time. 

One  common  method  of  reducing  the  possibility  of  glare  from  overhead 
lights  Is  to  place  the  screen  surface  very  nearly  vertical.  When  so 
tilted,  the  screen  must  then  be  placed  relatively  high  off  the  table  so 
that  It  can  easily  be  viewed  by  a  seated  person.  The  combination  of  tilt 
and  height  should  be  watched. 

To  some  extent,  equipment  design  can  help  reduce  the  chance  of  glare, 
but  eventually  this  will  be  the  responsibility  of  the  operational  personnel 
and  proctors.  Instructions  to  them  must  be  explicit  about  glare  as  well  as 
other  aspects  of  the  environment. 

HF3.  The  legibility  of  the  display  should  be  assessed  empirically. 
Desirable. 


The  legibility  of  the  display  and  the  speed  with  which  It  can  be  read  are 
Important  factors.  Many  people  are  not  yet  accustomed  to  reading  material 
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from  a  computer  screen.  The  letters  on  the  screen  are  less  distinctive, 
having  less  detail;  the  contours  are  also  necessarily  less  sharp.  Normally 
the  screen  shows  light  characters  on  a  dark  ground,  which  Is  the  reverse  of 
print.  In  any  case.  It  will  be  Important  to  check  the  speed  and  accuracy 
of  viewer  comprehension  of  material  on  the  proposed  display  screen, 
relative  to  ordinary  print. 

HF4 .  The  response  device  should  be  carefully  designed;  the  display  screen 
should  give  a  clear  positive  indication  of  the  response  selected;  the 
testtaker  should  be  able  to  alter  his  response  if  he  thinks  he  pushed  the 
wrong  button.  Essential. 

The  accn’’  :y  of  response  is  of  concern.  Once  the  examinee  has  decided  that 
(c)  is  ti.w  correct  answer,  how  much  difficulty  does  he  have  in  indicating 
his  choice  to  the  computer,  and  in  verifying  that  he  indicated  what  he 
intended.  We  would  expect  the  computer  terminal  to  have  a  definite 
advantage  here,  by  comparison  with  the  usual  answer  sheet  with  bars  to  be 
blackened.  The  main  advantage  is  place-keeping.  In  a  CAT,  the  examinee 
cannot  mark  the  wrong  item,  but  can  he  quickly  find  the  right  button  to 
press  (or  its  equivalent,  with  other  response  devices)? 

Everyone  makes  occasional  errors.  The  screen  must  provide  Immediate 
(say  within  1/2  second)  feedback  to  the  examinee  about  which  response  was 
actually  selected.  Then  some  mechanism  must  be  provided  to  permit  changing 
the  response  If  it  was  In  error.  One  possibility  is  to  accept  the  response 
immediately  but  permit  the  respondent  to  change  it,  if  he  responds  within 
some  short  time  (such  as  three  seconds.)  Change  could  be  signalled  by  a 
new  response,  or  by  pressing  a  separate  "change”  button,  followed  by  a  new 
response.  If  the  next  item  is  ready  for  presentation  before  the 
cancellation  intenral,  it  can  either  be  held  or  its  availability  can 
automatically  terminate  the  cancellation  interval,  which  would  then  he 
variable.  If  the  former,  then  the  fed-back  response  could  blink  during  the 
cancellation  interval. 

Another  procedure  would  require  the  respondent  to  signal  positively, 
by  pressing  a  "verify"  button,  that  the  response  fed  back  by  the  system  was 
the  response  intended.  This  would  he  like  the  "return"  key  on  most 
computer  terminals.  Requiring  verification  requires  more  button  pushes, 
and  makes  the  response  process  more  complex,  hence  more  prone  to  errors. 

But  it  may  save  inadvertent  responses,  and  it  does  require  the  respondent 
to  make  sure  that  the  recorded  response  was  the  intended  response. 

Empirical  observations  are  needed  to  guide  this  decision.  We  tend  to  favor 
the  "verify"  button,  but  the  choice  is  by  no  means  obvious. 

HF5.  The  effect  of  the  response  mechanism  on  the  speeded  tests  must  be 
determined.  Essential.  (See  later  section  on  the  speeded  tests.) 

Responding  is  especially  critical  in  the  speeded  tests,  in  which  speed  of 
cognitive  functioning  is  at  issue.  Here,  responding  should  be  especially 
easy  and  compatible  with  the  item  display  format.  Again,  we  expect  the 
computer  terminal  to  be  superior  to  marking  answers  on  an  answer  sheet. 

Note  here  that  we  are  only  concerned  with  selecting  one  out  of  4  or  5 
alternatives.  The  possibility  of  a  free-answer  format,  especially  in  the 
numerical  operations  test,  is  intriguing.  It  should  definitely  be  studied 
for  future  use. 
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Part  of  the  difficulty  In  choosing  the  response  mode,  discussed  In  the 
previous  section,  Is  that  a  different  mode  may  be  needed  for  the  speeded 
tests.  Probably  verification  should  not  be  required  In  the  speeded  tests, 
and  correction  should  not  be  allowed.  Also,  the  next  Item  should  appear  as 
soon  as  possible.  There  should  be  a  fixed  Interval  between  the  response 
and  Its  feedback  to  one  Item  and  the  presentation  of  the  next  Item.  One 
second  might  be  enough:  two  seconds  would  seem  to  be  an  upper  limit.  Note 
that  no  tailoring  Is  necessary  for  the  speeded  tests. 

HF6.  The  display  must  be  able  to  Include  diagrams  that  have  fine  detail. 
Essential . 

Legibility  of  diagrams  and  line  drawings  Is  an  especially  vexing  proLle/a. 
Here  the  limitations  of  the  display  may  have  to  Interact  with  the  Item 
production  and  selection  system.  Some  drawings  may  have  fine  detail  that 
Is  Irrelevant.  Others  may  have  fine  detail  that  Is  relevant,  and  that  Is 
obscured  on  the  computer  display  screen.  It  would  be  best  If  TV-quallty 
figures  and  diagrams  could  be  used.  The  ordinary  microcomputer  terminal 
has  at  best  about  200  lines  of  resolution,  not  enough  for  some  of  the 
drawings  In  the  current  versions  of  the  ASVAB.  Graphics  terminals  with  a 
resolution  of  at  least  400  lines  can  produce  acceptable  figures. 

HF7 .  The  test  proctor  should  be  able  to  monitor  test  performance  and 
should  be  signalled  automatically  when  Irregularities  occur.  Very 
desirable. 

The  test  proctor  should  be  warned  by  the  system  if  an  active  terminal  has 
not  produced  a  response  In  some  reasonable  time  period.  Other  erratic 
behavior,  such  as  excessive  responses,  responses  within  0.1  second  of  Item 
presentation,  or  similar  peculiar  patterns,  may  mean  that  the  examinee  does 
not  understand  how  to  use  the  terminal,  or  It  may  mean  that  the  terminal  is 
operating  Incorrectly. 

HF8.  The  test  terminals  should  always  be  In  proper  working  order. 

Essential . 

Any  flaw  in  the  terminal,  such  as  sticky  keys,  may  disrupt  test 
performance.  Even  on  untlmed  tests,  persons  at  a  computer  terminal  usually 
feel  (assume)  that  fast  response  Is  required,  and  Inability  to  do  so  may 
disconcert  them.  In  general,  a  schedule  of  frequent  regular,  maintenance 
should  be  established,  to  keep  the  terminal  display  clean  and  In  proper 
working  order. 

HF9.  The  terminal  response  system  should  Include  some  additional  buttons 
or  similar  controls  for  future  use.  Desirable. 

The  ordinary  Item  on  a  CAT  will  be  an  Item  that  Is  shown  all  at  once  on  the 
display  screen  and  that  requires  a  selection  of  one  from  a  few  alternative 
responses.  But  the  equipment  should  permit  constructed  responses, 
especially  for  numerical  problems.  Also,  some  future  items  may  Involve 
successive  displays  that  may  optionally  be  shown  again  at  the  examinee's 
request.  Flexibility  for  future  development  Is  Important. 
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Special  Issues 

Following  are  recommendations  about  some  Issues  that  could  not  he 
classified  above. 

The  Speeded  Tests.  Two  of  the  ASVAB  subtests.  Numerical  Operations 
and  Coding  Speed  are  highly  speeded  tests.  The  current  theory  of  adaptive 
testing  does  not  apply  to  speeded  tests.  Item  response  theory  assumes  a 
power  test,  and  would  prefer  that  every  respondent  answer  every  Item 
presented  to  him.  Thus  the  speeded  ASVAB  tests  cannot  now  be  made 
adaptive. 

However,  the  speeded  tests  can  and  should  be  administered  by  computer 
In  the  CAT  environment.  Here  the  particular  design  of  the  display  and 
response  devices  will  play  a  role  In  the  difficulty  of  the  test.  Very 
likely  the  computer  version  will  permit  students  to  work  faster  than  the 
paper-and-pencll  version,  because  a  keyboard  response  Is  probably  faster 
and  less  prone  to  error  than  marking  an  answer  sheet.  It  Is  hoped  that  the 
amount  of  difference  between  the  computer  and  the  paper-and-pencll  versions 
will  be  constant  for  all  test-takers,  but  this  must  be  checked.  Actually, 
a  constant  difference  for  all  test  takers  Is  unlikely.  Most  of  the  reasons 
for  a  difference,  such  as  legibility,  difficulty  of  place  keeping  on  the 
answer  sheet,  etc.  tend  to  apply  at  the  Item  level;  there  Is  more  likely  to 
be  either  a  constant  difference  per  Item,  or  a  proportional  difference  per 
Item.  In  either  case  there  would  then  be  a  slight  change  In  the  test  score 
distribution  for  the  higher  scores;  this  Is  not  likely  to  be  a  serious 
problem,  but  again  It  should  be  checked. 

U1 .  The  computing  system  must  be  carefully  designed  for  the  speeded  tests 
so  that  the  system  Itself  adds  no  variability  In  testing  time.  Essential. 

Each  new  Item  should  be  presented  a  fixed  time  after  the  respondent  presses 
the  response  button.  This  fixed  time  should  be  as  short  as  possible 
consistent  with  the  requirement  that  it  be  essentially  constant.  An 
Inter-ltem  time  of  one  second  would  seem  an  upper  limit,  and  500 
milliseconds  might  be  better.  (As  a  system  specification,  the  fixed  time 
Interval  should  have  some  tolerance  level,  such  as  +  57..) 

U2 .  The  calibration  tables  for  the  speeded  tests  must  be  prepared  using 
data  from  the  operational  equipment.  Essential. 

The  equipment  will  have  a  main  effect  even  If  It  doesn’t  contribute  to  the 
measurement  error  variance.  Since  the  score  on  these  tests  Is  essentially 
the  number  of  Items  correct  in  a  fixed  time  interval,  the  speed  of  reading 
the  display  and  using  the  response  mechanism  Is  a  part  of  the  score.  Thus 
calibration  must  Involve  the  actual  equipment  to  be  used. 

One  obvious  Implication  Is  that  special  scoring  problems  will  arise  If 
the  computer  display  and  response  equipment  is  not  Identical  for  all 
test-takers.  If  more  than  one  kind  of  equipment  Is  Introduced,  separate 
score  conversions  will  have  to  be  worked  out  for  each  type  of  equipment. 
Thus  whenever  a  new  an  better  model  of  test  terminal  Is  Introduced,  the 
speed  tests  will  have  to  he  recalibrated.  Also,  norms  should  he  developed 
for  alternative  equipment.  In  case  of  mobilization,  for  example,  there  may 
be  a  need  to  use  standard  computer  terminals,  somehow.  This  creates  a 
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severe  problem  for  the  tests  with  diagrams  (a  booklet  of  diagrams  might 
have  to  be  provided),  and  It  also  creates  a  severe  problem  for  the  norms  of 
the  speeded  tests.  It  may  be  that  moat  standard  terminals  can  be  made 
enough  alike  by  overlays  on  the  keyboard,  so  that  one  or  two  alternative 
calibrations  would  be  needed.  And  It  might  be  that  equipment  differences 
are  too  small  to  require  different  norms,  but  this  cannot  be  assumed. 
Empirical  evidence  Is  needed. 

U3 .  The  equipment  should  permit  recording  the  time  between  Item 
presentation  and  Item  response,  for  each  Item  for  each  respondent.  Highly 
desirable. 

Eventually,  we  would  hope  that  use  could  be  made  of  response  tiroes  to 
Individual  Items.  Research  will  be  needed  on  this  topic.  Probably  time  to 
the  nearest  1/60  second  would  be  sufficient,  but  time  to  the  nearest 
millisecond  might  be  handy.  Note  that  the  response  time  itself  may  be  as 
small  as  0.5  second. 

We  have  assumed  here  that  the  Items  will  be  presented  serially,  one  at 
a  time.  This  would  be  standard  practice  In  a  psychological  laboratory. 
Someone  has  suggested  that  several  Items  be  displayed  at  once,  the  display 
changing  to  a  new  batch  when  all  of  the  first  set  have  been  answered;  the 

score  would  be  the  time  to  respond  to  a  fixed  set  of  items,  or  the  total 

number  of  correct  Items  In  a  fixed  time  period.  Although  this  format  might 
be  more  nearly  like  the  P&P  test.  It  retains  some  of  the  placekeeping 
nature  of  the  P&P  test,  which  is  a  procedural  confound.  The  time  wanted  is 

only  the  time  to  do  the  cognitive  operations.  If  placekeeping  Is  to  be 

tested,  it  should  be  done  separately.  Further,  all  the  other  tests  will 
have  been  presented  one  Item  at  a  time  and  a  screenful  of  Items  would  be 
confusing  to  the  test-taker. 


Omits 


Expert  opinion  differs  concerning  whether  test- takers  should  be 
permitted  to  omit  an  item  on  a  CAT.  If  the  test  taker  doesn't  understand 
the  item,  forcing  a  response  may  mean  forcing  a  guess,  which  adds  error. 

On  the  other  hand,  the  item  has  in  principal  been  selected  as  the  most 
Informative  item  about  this  person's  ability.  It  would  be  unfortunate  to 
lose  the  utility  of  this  very  informative  item. 

If  omitting  is  permitted,  should  the  omitted  item  be  replaced  by  one 
of  equivalent  difficulty,  or  by  a  slightly  easier  item?  Omitting  means,  at 
least  in  part,  a  failure,  so  a  slightly  easier  item  would  seem  appropriate. 

Merely  presenting  an  easier  item  represents  a  slight  penalty  in  a  short 
tailored  test. 

If  the  respondents  are  permitted  to  omit  items  then  the  best 
psychometric  procedure  would  Involve  the  use  of  a  graded  response  model. 

It  is  often  found  that  an  omit  deserves  more  credit  than  an  incorrect 
response.  However,  it  would  be  very  difficult  to  accumulate  enough  data 
through  spontaneous  omitting  of  items.  Also  not  much  experience  has  yet 
accrued  concerning  graded  response  models.  Thus  at  present  it  is  difficult 
to  determine  what  to  do  with  omitted  responses. 

U4 .  For  the  present,  omits  should  not  be  permitted  .  Desirable. 

This  seems  the  most  defensible  alternative,  psychometrlcally.  However,  we 
are  not  comfortable  with  this  recommendation,  and  strongly  urge  additional 
research  on  various  alternatives.  Graded  response  models  are  practical 
with  a  computer,  and  should  be  explored,  with  and  without  omitting. 


Item  Bias 


It  Is  Important  that  Insofar  as  possible,  the  Items  on  the  test  should 
not  be  offensive  to  any  group  of  persons,  nor  should  they  favor  any  one 
group  more  than  another,  apart  from  the  ability  being  tested.  Of  course, 
one  group  may  actually  surpass  another  on  some  ability.  Men,  for  example, 
may  score  higher  than  women  on  shop  knowledge  because  on  the  average  they 
know  more  about  shop.  But  even  when  such  overall  group  differences  are 
controlled  statistically,  some  items  may  show  a  group  difference  for  other. 
Irrelevant  reasons.  Such  items  would  be  considered  biased. 

U5 .  All  potential  items  for  the  CAT  item  pools  should  be  screened  in  an 
attempt  to  identify  and  discard  items  that  are  offensive  to  ethnic  groups, 
or  to  women,  or  men.  Items  should  be  screened  by  judges  qualified  to 
identify  such  biased  content.  (Essential) 

U6.  For  each  item  pool,  statistical  studies  should  be  done  of  item  bias, 
by  comparing  subgroup  performance  on  the  items.  (Highly  desirable). 

Several  statistical  procedures  have  been  used,  and  no  one  is  generally 
believed  to  be  better  than  another.  With  conventional  item  analysis,  the 
simplest  procedure  is  to  determine  difficulty  (equated  delta  plots)  for 
identifiable  subgroups.  Separate  analyses  should  be  done  comparing  Whites 
and  Blacks,  and  comparing  men  and  women.  With  IRT,  a  straightforward 
procedure  is  to  obtain  item  parameters  separately  for  the  two  groups,  and 
to  compare  the  item  response  curves  visually,  as  well  as  statistically. 
Because  the  groups  may  have  large  average  differences,  an  analysis  should 
also  be  done  with  a  sample  from  the  majority  population  that  has  the  same 
score  distribution  as  the  minority  groups,  generating  the  IRC's  from  groups 
of  comparable  average  ability.  Methods  for  studying  item  bias  have  been 
discussed  by  Shepard  (1981),  Levine  (1982),  and  in  Berk  (1982). 

Because  such  studies  have  recently  been  made  of  one  current  form  of 
the  ASVAB,  (Bock  &  Mislevy,  1981)  the  need  for  item  bias  studies  is  not 
pressing.  They  should  be  done,  but  can  be  done  after  more  critical 
problems  have  been  solved.  (See  also  Wing,  1980.) 

We  note  that  statistical  studies  of  item  bias  seldom  find  evidence  of 
item  bias.  (See,  for  example,  Linn  et  al,  1981.)  Apparently,  the 
screening  process  usually  works  quite  well,  so  no  blatantly  biased  items 
are  missed.  Items  identified  by  statistical  methods  as  possibly  biased  are 
frequently  baffling,  in  the  sense  that  nothing  in  their  content  seems  at 
all  likely  to  lead  to  unusual  group  differences.  This  does  not  mean  that 
statistical  studies  should  not  be  made,  but  that  sometimes  the  results  of 
such  studies  are  difficult  to  Interpret. 

Such  studies  are  also  difficult  for  operational  reasons.  There  may  be 
relatively  few  cases  in  a  minority  group.  It  may  be  necessary  to  restrict 
the  study  to  a  subset  of  items,  or  else  the  study  may  have  to  wait  until 
the  requisite  cases  have  been  accumulated. 


IV.  Procedural  Details  and  Future  Prospects 


Comments  on  the  Procedures 


A  variety  of  special  procedures  must  be  adopted  to  permit  a  computer 
to  conduct  a  testing  session,  as  well  as  to  store  the  CAT  Item  pool  In  the 
first  place.  A  method  Is  needed  to  estimate  the  Item  parameters  (£,  £). 

An  algorithm  is  needed  to  tailor  the  test,  and  to  score  each  respondent. 
Specific  decisions  must  be  made  about  many  details.  Including  balancing 
Item  options,  choosing  a  starting  value  for  each  person  on  each  test,  how 
to  present  the  Instructions,  and  management  of  rest  periods. 

PI.  All  procedures  must  be  documented  and  described  In  enough  detail  so 
that  the  procedures  could  be  reproduced  on  another  computer  from  the 
documentation  alone.  (Essential) 

The  Importance  of  explicit  documentation  cannot  be  overemphasized. 

Evaluating  the  psychometric  quality  of  the  test  may  depend  on  knowing  some 
details  of  the  procedure.  For  example,  the  manner  of  estimating  item 
parameters  Is  critical  to  the  equating  process,  future  evaluations  and 
research  projects  may  require  knowing  certain  parts  of  the  procedure. 
Certainly  the  project  must  never  be  In  the  position  of  viewing  the  computer 
as  a  mysterious  black  box  with  a  mind  of  its  own.  It  should  never  be 
necessary  to  attribute  a  result  to  some  inexplicable  decision  of  the 
computer. 

The  details  of  the  administration  of  the  CAT  are  not  In  the  province 
of  this  committee,  except  as  they  affect  the  evaluation.  Still,  the 
committee  wishes  to  provide  some  comments  on  the  procedures  that  should  be 
considered  by  those  in  charge  of  administration. 

1.  Rest  periods.  The  current  ASVAB  requires  about  2  1/2  hours  plus  rest 
breaks  and  Instructions.  Presumably  the  CAT  version  will  reduce  this  to 
about  1  1/^  to  1  1/2  hours.  Test-takers  will  have  to  be  given  a  short  rest 
break,  at  least  once  or  twice  In  the  total  testing  session.  Continuous 
interaction  with  a  computer  display  can  be  very  tiring.  The  system  must  be 
programmed  to  provide  these  breaks,  and  to  respond  to  some  kind  of  signal 
that  the  break  Is  finished.  The  system  must  also  be  able  to  accommodate 
unscheduled  breaks,  in  emergency  situations. 

Almost  certainly,  some  test-takers  will  complain  of  eye  strain  as  a 
result  of  watching  the  display  screen  more  or  less  constantly  for 
about  90  minutes  or  more.  Probably  the  eye  strain  is  an  excuse  for  the  more 
fundamental  problem  of  cognitive  strain.  The  test  will  seem  moderately 
difficult  to  everyone,  since  Ideally  they  will  only  get  about  two-thirds  of 
the  Items  correct.  Perhaps  some  initial  warning  to  that  effect  could  be 
provided.  Still,  one  or  two  breaks  between  tests  would  be  advisable. 

2 .  Instructions  and  sample  Items.  The  mode  of  presenting  and  checking 
Instructions  for  each  test  deserves  careful  attention.  What  If  the 
examinee  answers  a  sample  Item  incorrectly?  Probably  he  should  be  forced 
to  continue  to  respond  until  he  gets  It  correct.  But  should  an  easier 
sample  Item  then  be  displayed,  the  process  continuing  until  he  can  answer 
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one  of  the  sample  Items  correctly  on  the  first  try?  Should  the  test 
supervisor  be  called  to  check  the  terminal?  Perhaps  the  examinee  doesn't 
understand  the  use  of  the  response  buttons.  These  Issues  deserve  careful 
attention  and  planning. 

Should  the  examinee  be  permitted  to  review  the  Instructions,  once  he 
has  started  on  the  tests?  Probably  so,  but  this  may  be  subject  to  revision 
If  the  examinees  overuse  the  options.  This  creates  the  problem  of 
designing  a  means  for  the  examinee  to  review  the  Instructions  and  to  cancel 
the  review. 

The  examinee  will  not  be  permi.tted  to  return  to  a  previous  Item  once 
the  next  Item  has  been  presented.  In  the  P&P  version,  of  course,  the 
examinee  may  erase,  go  back,  and  reconsider  ad  11b,  probably  to  his 
detriment. 

3.  Response  time  of  terminal.  The  computer  should  respond  within  one 
second  to  test-taker  responses  and  key  presses.  A  few  seconds  may  be 
tolerable  between  successive  Items  on  unspeeded  tests,  though  this  time 
should  be  kept  as  short  as  possible.  A  design  objective  of  two  seconds 
should  be  established.  Three  seconds  between  items  might  be  tolerable,  but 
five  seconds  would  seem  unreasonable.  Shorter  response  times  are  needed 
for  the  speeded  tests. 

4.  Monitoring  quality.  A  maximum  elapsed  time  should  be  established 
between  the  presentation  of  an  Item  to  the  person  and  the  receipt  of  a 
response  from  the  person.  This  maximum  Interval  may  be  one  minute,  or  may 
be  30  seconds  -  It  should  certainly  be  an  adjustable  system  parameter. 
Experience  will  dictate  the  best  setting;  this  default  Is  necessary  to 
guard  against  a  test  taker  who  Is  not  alert,  or  a  terminal  that  Is  not 
working  properly. 
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The  Introduction  of  a  computerized  adaptive  version  of  the  ASVAB  will 
represent  the  first  large-scale  use  of  CAT.  The  recommendations  In  this 
report  are  designed  to  Insure  that  CAT  provides  the  expected  substantial 
Increase  In  efficiency  with  no  loss  In  quality.  The  procedure  will  need  to 
be  monitored  to  insure  that  It  Is  operating  In  the  ways  that  are  expected. 
Also,  the  present  efforts  have  mainly  been  toward  establishing  an 
operational  test.  Maintenance  of  the  testing  procedure  on  an  operational 
basis  will  require  added  attention.  New  Items  will  have  to  be  added  from 
time  to  time,  old  Items  will  have  to  be  retired.  The  maintenance  and 
evaluation  of  the  new  version  of  the  test  will  require  extended  attention 
from  research  and  development  specialists. 


It  will  be  necessary  to  check  and  refine  the  psychometric  procedures 
employed  In  CAT.  Present  technology  has  been  developed  In  the  absence  of 
extensive  experience  with  CAT.  As  experience  accrues,  procedures  must 
certainly  be  monitored,  and  may  well  require  some  modification.  A  number 
of  theoretical  and  practical  questions  remain  to  be  answered.  For  example, 
how  sensitive  Is  the  process  to  biased  estimates  of  Item  parameters?  Is 
the  Item  selection  algorithm  working  as  expected?  If  omitting  Is 
permitted,  la  It  widespread?  Is  there  evidence  of  Inappropriate  responses 
(low  scoring  candidates  getting  difficult  Items  correct,  or  high  scoring 
candidates  missing  easy  Items.)  Is  the  equating  and  normlng  of  scores 
satisfactory?  Statistical  methods  are  needed  for  assessing  the 
unldlmenslonallty  of  Item  banks.  In  general,  statistical  methods  are 
needed  for  analyzing  the  kind  of  Item  response  data  that  emerges  from  the 
CAT. 


Throughout  this  report  we  have  Identified  other  Issues  that  require 
experimental  study.  Many  choices  have  had  to  be  based  on  judgment,  rather 
than  evidence.  Studies  should  be  done  on  all  these  Issues,  so  that 
knowledge  can  replace  opinion,  to  provide  a  solid  basis  for  CAT  procedures. 

Research  Is  needed  In  every  area  -  dimensionality,  reliability,  validity. 
Item  parameter  estimation  and  linking  Item  pool  characteristics,  Item 
selection  ability  estimation,  and  scale  calibration.  An  extensive  program 
of  psychometric  research  Is  a  necessary  adjunct  of  a  CAT  system. 

In  addition  to  these  technical  matters,  there  are  many  opportunities 
for  CAT  to  make  fundamental  improvements  In  the  personnel  selection  and 
classification  system.  The  Introduction  of  the  computer-administered 
adaptive  version  of  the  ASVAB  has  great  potential  for  Improved  personnel 
assessment  In  the  Armed  Forces,  quite  a  part  from  the  Immediate  savings 
realized  In  the  recruit  testing  process.  To  realize  this  potential, 
further  research  and  development  projects  are  needed  to  develop  the  most 
promising  possibilities. 


The  most  likely  benefits  from  CAT  are  Improved  measurements  of 
abilities  now  Included  In  the  ASVAB  and  the  addition  or  change  of  abilities 
now  measured.  The  present  ASVAB  Is  well  designed,  but  the  tests  In  the 
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battery  are  Intercorrelated  to  an  extent  that  detracts  from  potential 
validity.  That  is,  the  present  tests  are  not  sufficiently  distinct. 
Validity  is  likely  to  be  Improved  if  tests  measuring  different  aspects  of 
ability  are  Included.  Validity  of  the  ASVAB  for  predicting  performance  in 
various  technical  schools  is  adequate,  but  modest.  There  is  much  room  for 
Improvement.  Any  Improvement  in  validity  translates  directly  to  economic 
benefits  in  reducing  dropout  and  failure  rates.  Both  the  services  and  the 
individual  recruit  lose  if  the  recruit  is  placed  in  a  specialty  for  which 
he  or  she  is  not  qualified. 

Various  possibilities  can  be  explored  for  obtaining  more  information 
from  the  present  test.  Additional  measures,  such  as  reaction  time,  can  be 
taken  (Mlcko,  1969,  Thissen,  1980.)  The  items  can  be  presented  in  a  format 
that  requires  the  recruit  to  continue  choosing  answer  options  until  he  hits 
the  correct  answer.  Free  answer  formats  can  be  tried,  at  least  on  the  test 
of  arithmetic  operations.  It  is  by  no  means  obvious  how  to  use  this 
additional  information,  so  considerable  exploration  will  be  required. 

A  better  way  to  get  more  information  is  to  use  new  item  types.  One  is 
already  being  studied  by  McBride  (1980).  He  is  altering  the  presentation 
of  reading  comprehension  items  so  that  the  passage  appears  first.  Then, 
when  the  respondent  is  ready,  the  passage  disappears  and  is  replaced  by  the 
question  about  the  passage.  This  makes  the  test  sufficiently  different 
from  the  current  version  that  it  would  not  be  wise  to  Include  in  the 
original  CAT  version  of  Che  ASVAB.  It  has  a  memory  component  that  may  make 
the  test  more  distinctive,  and  hence  not  strictly  comparable  with  the 
parallel  P&P  mode.  It  is  however  a  goal,  but  for  the  future  since  it  may 
be  more  valid  for  many  uses. 

The  best  way  to  get  more  Information  is  to  test  additional  aptitudes 
and  skills.  Other  tests  or  test  items  Include  spatial  visualization,  not 
now  a  part  of  ASVAB,  items  with  moving  parts  for  mechanical  ability  tests, 
judgments  about  collision  of  two  or  more  moving  elements,  discrimination  of 
temporal  Intervals,  and  more.  The  list  is  endless.  Although  some  of  these 
possibilities  will  not  prove  to  be  useful,  others  surely  will  be 
sufficiently  promising  that  they  would  considerably  Improve  the 
predictability  of  recruit  performance. 

The  relationship  and  possible  contribution  of  computerized  adaptive 
^-estlng  to  the  process  of  counselling  and  placement  procedures  needs  study. 

Instead  of  asking,  "Will  this  particular  recruit  pass  School  A?  School 
B?",  the  Armed  Services  will  increasingly  be  asking,  "How  can  we  best  use 
this  recruit?"  (Or  from  the  recruit's  perspective,  "How  can  this  recruit 
best  realize  his  or  her  potential?”)  This  implies  using  measures  like  the 
ASVAB  for  placement,  rather  than  selection.  Over  the  years,  the  Armed 
Services  have  studied  the  placement  problem,  but  there  has  been  little 
attempt  to  design  ASVAB  with  an  eye  on  placement  rather  than  selection.  In 
a  CAT  environment,  in  addition  to  new  tests,  there  are  other  possibilities. 

Possibly  everyone  need  not  be  tested  on  all  variables.  The  counselling 
opportunities  can  be  explored  in  the  context  of  existing  work  on  placement 
systems  in  the  Air  Force  (Hendrix,  Ward,  Pina,  &  Harvey,  1979),  and  the 
Navy  (Horst  &  Sorensen,  1976). 


The  many  possibilities  for  research  and  development  represent 
opportunities  to  realize  additional  economic  benefit  from  a  CAT  system. 
The  operational  benefits  of  adopting  CAT  are  manifest;  the  possibilities 
mentioned  here  are  ways  of  getting  added  value  from  the  Investment. 
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Department  of  Psychology 
University  of  Colorado 
Boulder,  CO  80309 

1  Col  Ray  Bowles 
800  N.  Quincy  St. 

Room  804 

Arlington.  VA  22217 

1  Dr.  Robert  Brennan 

American  College  Testing  Programs 

P.  0.  Box  168 

Iowa  City.  TA  52240 

1  DR.  C.  VICTOR  BUNDERSON 
WICAT  INC. 

UNIVERSITY  PLAZA.  SUITE  10 
1160  SO.  STATE  ST. 

OREM,  UT  84057 

1  Dr.  John  B.  Carroll 
Psychometric  Lab 
Univ.  of  No.  Carolina 
Davie  Hall  013A 
Chapel  Hill,  NC  27514 

1  Charles  Myers  Library 
Livingstone  House 
Livingstone  Road 
Stratford 
London  E15  2LJ 
ENGLAND 

1  Dr.  Kenneth  E.  Clark 

College  of  Arts  A  Sciences 
University  of  Rochester 
River  Campus  Station 
Rochester,  NY  14627 

1  Dr.  Norman  Cliff 
Dept,  of  Psychology 
Univ.  of  So.  California 
University  Park 
Los  Angeles,  CA  90007 
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1  Dr.  Deborah  Coates 
Catholic  University 
620  Michigan  Ave.  NE 
Washington,  DC  20064 

1  Dr.  William  E.  Coffhian 

Director,  Iowa  Testing  Programs 
334  Lindquist  Center 
University  of  Iowa 
Iowa  City,  lA  52242 

1  Dr.  Meredith  P.  Crawford 

American  Psychological  Association 
1200  17th  Street,  N.W. 

Washington,  DC  20036 

1  Dr.  Hans  Crombag 

Education  Research  Center 
University  of  Leyden 
Boerhaavelaan  2 
2334  EM  Leyden 
The  NETHERLANDS 

1  Director 

Behavioural  Sciences  Division 
Defence  A  Civil  Institute  of 
Environmental  Medicine 
Post  Office  Box  2000 
Downsvlew,  Ontario  M3M  3B9 
CANADA 

1  Dr., Fritz  Drasgow 

Yale  School  of  Ort..nizatlon  and  Manageme 
Yale  University 
Box  1A 

New  Haven,  CT  06520 

1  Dr.  Mavln  D.  Dunnette 

Personnel  Decisions  Research  Institute 
2415  Foshay  Tower 
821  Marguette  Avenue 
Mlneapolls,  NN  55402 

1  Mike  Durmeyer 

Instructional  Program  Development. 

Building  90 

MET-PDCD 

Great  Lakes  NTC,  IL  60088 


1  ERIC  Facility-Acquisitions 
4833  Rugby  Avenue 
Bethesda,  MD  20014 

1  Dr.  A.  J.  Eschenbrenner 
Dept.  E422,  Bldg.  81 
McDonnell  Douglas  Astronautics  Co. 
P.O.Box  516 
St.  Louis,  MO  63166 

1  Dr.  Benjamin  A.  Fairbank,  Jr. 
McFann-Gray  &  Associates,  Inc. 

5825  Callaghan 
Suite  225 

San  Antonio,  Texas  78228 

1  Dr.  Leonard  Feldt 

Lindquist  Center  for  Measurment 
University  of  Iowa 
Iowa  City,  lA  52242 

1  Dr.  Richard  L.  Ferguson 

The  American  College  Testing  Program 

P.O.  Box  168 

Iowa  City,  TA  52240 

1  Dr.  Victor  Fields 
Dept,  of  Psychology 
Montgomery  College 
Rockville,  MD  20850 

1  Univ.  Prof.  Dr.  Gerhard  Fischer 
Liebiggasse  5/3 
A  1010  Vienna 
AUSTRIA 

1  Professor  Donald  Fitzgerald 
University  of  New  England 
Armldale,  New  South  Wales  2351 
AUSTRALIA 

1  Dr.  John  R.  Frederiksen 
Bolt  Beranek  &  Newman 
50  Moulton  Street 
Cambridge,  MA  02138 
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1  DR.  ROBERT  GLASER 
LRDC 

UNIVERSITY  OF  PITTSBURGH 
3939  O'HARA  STREET 
PITTSBURGH,  PA  15213 

1  Dr.  Frank  E.  Goner 

McDonnell  Douglas  Astronautics  Co. 

P.  0.  Box  516 
St.  Louis.  MO  63166 

1  Dr.  Daniel  Gopher 

Industrial  &  Management  Engineering 
Technion-Israel  Institute  of  Technology 
Haifa 
ISRAEL 

1  Dr.  Bert  Green 

Johns  Hopkins  University 
Department  of  Psychology 
Charles  &  3Ath  Street 
Baltimore,  MD  21218 

1  Dr.  Ron  Hambleton 
School  of  Education 
University  of  Massechusetts 
Amherst,  MA  01002 

1  Dr.  Delwyn  Harnisch 
University  of  Illinois 
2<l2b  Education 
Urbana,  IL  61801 

1  Dr.  Chester  Harris 
School  of  Education 
University  of  California 
Santa  Barbara,  CA  93106 

1  Dr.  Lloyd  Humphreys 

Department  of  Psychology 
University  of  Illinois 
Champaign,  IL  61820 

1  Library 

Hum PRO/ Western  Division 
27857  Berwick  Drive 
Carmel,  CA  93921 


Non  Govt 


1  D“.  Steven  Hunka 

Department  of  Education 
University  of  Alberta 
Edmonton,  Alberta 
CANADA 

1  Dr.  Earl  Hunt 

Dept,  of  Psychology 
University  of  Washington 
Seattle,  WA  98105 

1  Dr.  Jack  Hunter 
2122  Cool id ge  St. 

Lansing,  MI  48906 

1  Dr.  Huynh  Huynh 

College  of  Education 
University  of  South  Carolina 
Columbia,  SC  29208 

1  Professor  John  A.  Keats 
University  of  Newcastle 
AUSTRALIA  2308 

1  Mr.  Jeff  Kelety 

Department  of  Instructional  Technology 
University  of  Southern  California 
Los  Angeles,  CA  92007 

1  Dr.  Michael  Levine 

Department  of  Educational  Psychology 
210  Education  Bldg. 

University  of  Illinois 
Champaign,  IL  61801 

1  Dr.  Charles  Lewis 

Faculteit  Sociale  Wetenschappen 
Rijksuniversiteit  Groningen 
Oude  Boteringestraat  23 
9712GC  Groningen 
Netherlands 

1  Dr.  Robert  Linn 

College  of  Education 
University  of  Illinois 
Urbana,  IL  61801 
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Non  Govt 


1  Dr.  Frederick  M.  Lord 

Educational  Testing  Service 
Princeton,  NJ  08540 

1  Dr.  James  Lumsden 

Department  of  Psychology 
University  of  Western  Australia 
Nedlands  V.A.  6009 
AUSTRALIA 

1  Mr.  Merl  Malehorn 
Dept .  of  Navy 
Chief  of  Naval  Operations 
OP-113 

Washington,  DC  20350 

1  Dr.  Gary  Marco 

Educational  Testing  Service 
Princeton,  NJ  08450 

1  Dr.  Scott  Maxwell 

Department  of  Psychology 
University  of  Houston 
Houston,  TX  77004 

1  Dr.  Samuel  T.  Mayo 

Loyola  University  of  Chicago 
820  North  Michigan  Avenue 
Chicago,  IL  60611 

1  Professor  Jason  Millman 
Department  of  Education 
Stone  Hall 
Cornell  University 
Ithaca,  NY  14853 

1  Bill  Nordbrock 

Instructional  Program  Development 

Building  90 

NET-PDCD 

Great  Lakes  NTC,  IL  60088 

1  Dr.  Melvin  R.  Novick 

356  Lindquist  Center  for  Measurraent 
University  of  Iowa 
Iowa  City,  lA  52242 


1  Dr.  Jesse  Orlansky 

Institute  for  Defense  Analyses 
400  Army  Navy  Drive 
Arlington,  VA  22202 

1  Wayne  M.  Patience 

American  Council  on  Education 
GED  Testing  Service,  Suite  20 
One  Dupont  Cirle,  NW 
Washington,  DC  20036 

1  Dr.  James  A.  Paulson 

Portland  State  University 
P.O.  Box  751 
Portland,  OR  97207 

1  MR.  LUIGI  PETRULLO 

2431  N.  EDGEWOOD  STREET 
ARLINGTON,  VA  22207 

1  Dr.  Richard  A.  Poliak 

Director,  Special  Projects 
Minnesota  Educational  Computing  Consortl 
2520  Broadway  Drive 
St.  Paul,MN  55113 

1  DR.  DIANE  M.  RAMSEY-KLEE 

R-K  RESEARCH  4  SYSTEM  DESIGN 
3947  RIDGEMONT  DRIVE 
MALIBU,  CA  90265 

1  MTNRAT  M.  L.  RAUCH 
P  II  4 

BUNDESMINISTERIUM  DER  VERTETDTGUNG 

POSTFACH  1328 

D-53  BONN  1,  GERMANY 

1  Dr.  Mark  D.  Reckase 

Educational  Psychology  Dept. 

University  of  Missouri-Columbia 
4  Hill  Hall 
Colunbia,  MO  65^-1 

1  Dr,  Leonard  L.  Rosenbaum,  Chairman 
Department  of  Psychology 
Montgomery  College 
Rockville,  MD  20850 
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1  Dr.  Latirence  Rudner 
403  Cln  Avenue 
Takoma  Park,  MD  20012 

1  Dr.  J.  Ryan 

Department  of  Education 
University  of  South  Carolina 
Columbia,  SC  29208 

1  PROF.  FUMIKO  SAMF.JIHA 
DEPT.  OF  PSYCHOLOGY 
UNIVERSITY  OF  TENNESSEE 
KNOXVILLE,  TN  37916 

1  Frank  L.  Schmidt 

Department  of  Psychology 
Bldg.  GG 

George  Washington  University 
Washington,  DC  20052 

1  Dr.  Kazuo  Shigemasu 
University  of  Tohoku 
Department  of  Educational  Psycho] ogy 
Kawauchi,  Sendai  980 
JAPAN 

1  Dr.  Edwin  Shirkey 

Department  of  Psychology 
University  of  Central  Florida 
Orlando,  FL  32816 

1  Dr.  Richard  Snow 
School  of  Education 
Stanford  University 
Stanford ,  CA  94305 

1  Dr.  Robert  Sternberg 
Dept,  of  Psychology 
Yale  University 
Box  11A,  Yale  Station 
New  Haven.  CT  06520 

1  Dr.  Thomas  G.  Sticht 

Director,  Basic  Skills  Division 
HUNRRO 

300  N.  Washington  Street 
Alexandria. VA  22314 


Non  Govt 


1  DR.  PATRICK  SUPPES 

INSTITUTE  FOR  MATHEMATICAL  STUDIES  IN 
THE  SOCIAL  SCIENCES 
STANFORD  UNIVERSITY 
STANFORD,  CA  94305 

1  Dr.  Hariharan  Swaminathan 

Laboratory  of  Psychometric  and 
Evaluation  Research 
School  of  Education 
University  of  Massachusetts 
Amherst,  MA  01003 

1  Dr.  Brad  Sympson 

Psychometric  Research  Group 
Educational  Testing  Service 
Princeton,  NJ  08541 

1  Dr.  Kikumi  Tatsuoka 

Computer  Based  Education  Research 
Laboratory 

252  Engineering  Research  Laboratory 
University  of  Illinois 
Urbana,  IL  61801 

1  Dr.  David  Thissen 

Department  of  Psychology 
University  of  Kansas 
Lawrence,  KS  66044 

1  Dr.  Douglas  Towne 

Univ.  of  So.  California 
Behavioral  Technology  Labs 
1845  S.  Elena  Ave. 

Redondo  Beach,  CA  90277 

1  Dr.  Robert  Tsutakawa 

Department  of  Statistics 
University  of  Missouri 
Columbia,  MO  65201 

1  Dr,  J.  Uhlaner 

Perceptronics,  Inc. 

6271  Varlel  Avenue 
Woodland  Hills,  CA  91?64 


HOPKINS/GREEN  April  19.  1982 


Page  12 


Non  Govt 


1  Dr.  David  Vale 

Assessment  Systems  Corporation 
2395  University  Avenue 
Suite  306 

St.  Paul.  MN  55114 

1  Dr.  Howard  Vainer 

Division  of  Psychological  Studies 
Educational  Testing  Service 
Princeton.  NJ  08540 

1  Dr.  David  J.  Weiss 
N660  Elliott  Hall 
University  of  Minnesota 
75  E.  River  Road 
Minneapolis.  NN  55455 

1  DR.  GERSHON  WELTMAN 
PERCEPTRONICS  INC. 

6271  VARIEL  AVE. 

WOODLAND  HILLS,  CA  91367 

1  Wolfgang  Wildgrube 
St  r e i tkr  ae  f teamt 
Box  20  50  03 
D-5300  Bonn  2 


